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INFERENCE IN ADDITIVELY SEPARABLE MODELS 
WITH A HIGH DIMENSIONAL SET OF CONDITIONING 

VARIABLES 

DAMIAN KOZBUR 


Abstract. This paper considers estimation and inference of nonpara- 
metric conditional expectation relations with a high dimensional condi¬ 
tioning set. Rates of convergence and asymptotic normality are derived 
for series estimators for models where conditioning information enters 
in an additively separable manner and satishes sparsity assumptions. 
Conditioning information is selected through a model selection proce¬ 
dure which chooses relevant variables in a manner that generalizes the 
post-double selection procedure proposed in [S] to the nonparametric 
setting. The proposed method formalizes considerations for trading off 
estimation precision with omitted variables bias in a nonparametric set¬ 
ting. Simulation results demonstrate that the proposed estimator per¬ 
forms favorably in terms of size of tests and risk properties relative to 
other estimation strategies. 

Key Words: nonparametric models, high dimensional-sparse regres¬ 
sion, inference under imperfect model selection. JEL Codes: Cl. 


1. Introduction 

Nonparametric estimation in economic and statistical problems is com¬ 
mon because it is appealing in applications for which functional forms are 
unavailable. In many problems, the primary quantities of interest can be 
computed from the conditional expectation function of an outcome variable 
y given a regressor of interest x. Nonparametric methods are often attrac¬ 
tive for estimating such conditional expectations since assuming an incorrect 
simple parametric model between the variables of interest will lead to incor¬ 
rect inference. 

In many econometric models, it is important to take into account con¬ 
ditioning information, z. When x is not randomly assigned, estimates of 
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partial effects of x on y will often be incorrect if z is ignored and at the 
same time, partly influences both x and y. However, if x can be considered 
as approximately randomly assigned given z, then the conditional expecta¬ 
tion of y given x and z can be used to calculate causal effects of x on ?/ and 
to evaluate counterfactuals. Therefore, properly accounting for conditioning 
information z is of primary importance. 

When conditioning information is important to the problem, it is neces¬ 
sary to replace the simple objective of learning the conditional mean func¬ 
tion £^[?/|x] = g{x) with the new objective of learning a family of conditional 
mean functions 

E[y\x,z\= g^{x) (1.1) 

indexed by 2 . Properly accounting for conditioning information in z may 
be done in several ways and leaves the researcher with important model¬ 
ing decisions. For the sake of illustration, four potential ways to account 
for conditioning information are (1) by specifying a partially linear model 
gz{x) = g{x) + z'f3, (2) by specifying an additive model gz{x) = g{x) + h{z), 
(3) by specifying a multiplicative model gz{x) = g{x)h{z), or (4) by speci¬ 
fying a fully nonparametric model gz{x) = g{x, z). The fully nonparametric 
model suffers from the curse of dimensionality even for moderately many 
covariates, while the partially linear model may be too rigid and may miss 
important conditioning information. 

Many applications have potentially large conditioning sets. A large condi¬ 
tioning set in economics may arise because the researcher wishes to control 
for many measured characteristics, like demographics, at the observation 
level. In the extreme cases, the dimension of z can be large enough so that 
all four example specifications for gz{x) listed above require an infeasible 
amount of data in order to avoid statistical overfitting and ensure good 
inference. 

This paper is restricted to studying the partially linear model and the 
additive model shown aboveQ Therefore, the interest is the specialization 
of model (jl.ip to the case 

E[y\x,z\= g{x)+ h{z). (1.2) 

When additive the model provides a good approximation to the underlying 
data generating structure, it is useful since many quantities describing the 
relationship of x and y conditional on z can be learned with a good under¬ 
standing of g{x) alone. In addition, it provides a clear description of how 
the conditional relation between x and y changes as z changes. 

An important structure that has been used in recent econometrics is ap¬ 
proximate sparsity. See for example 0, 0, and [ 9 ]. In the context of 
this paper, approximate sparsity informally refers to the condition that the 
conditional expectation function gzix) can be approximated by a family of 

^These models have a particular structure which make them convenient for the problem 
of selecting a conditioning set. The two alternative models are likely to require conditions 
considerably different than the ones considered here. 
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functions which depend only on a small (thongh a priori unknown) subset 
of the conditioning information contained in z. Sparsity is usefnl because in 
principal, a researcher can address statistical overfitting problems by locat¬ 
ing and controlling for only correct conditioning information. The focus of 
this paper is on providing a formal model selection technique over models for 
the conditioning set z which retrieve the relevant conditioning information. 
The retrieval of relevant conditioning information will be done in such a way 
so that standard estimation techniques performed after model selection will 
provide correct inference for nonparametric partial effects of x. 

This paper takes model (1.2) as a starting point and assnmes that h(z), a 
complicated function of many conditioning variables, has sparse structure. 
The strategy for identifying a sparse structure is to search for a small subset 
of relevant terms within a long series expansion for h{z). To formalize this, 
a high dimensional framework, allowing the long series expansion for h{z) 
to have more terms than the sample size is particularly convenient. Once a 
simple model for h{z) is found, the focus returns to the estimation of g{x). 
The paper contributes to the nonparametrics literature by establishing rates 
of convergence of estimates and asymptotic normality for functionals of g{x) 
after formal model selection has been performed to simplify the way the 
conditioning variable z affects the conditional expectation relation between 
X and y. 

In addition to addressing questions about flexible selection of conditioning 
set in a nonparametric setting, this paper contributes to a broader program 
aimed at conducting inference in the context of high-dimensional models. 
Statistical methods in high dimensions have been well developed for the 
purpose of prediction. Two widely used methods for estimating high di¬ 
mensional predictive relationships and are important for the present paper 
are Lasso and Post Lasso. The Lasso is a shrinkage procedure which esti¬ 
mates regression coefficients by minimizing a loss function pins a penalty 
for the size of the coefficient. Post-Lasso fits an ordinary least squares re¬ 
gression on variables with non-identically-zero estimated Lasso coefficients. 
For theoretical and simnlation resnlts about the performance of these two 
methods, see [IS] [37], [E] [IS] [I], [2], [E], [E], [E] [H], [E], [20], [23], 
[24], [26], [27], [28], [33], [37], [38], [40], [H], [4], [H], [4], among many more. 
Regularized estimation buys stability through reduction in estimate vari¬ 
ability at the cost of a modest bias in estimates. Regularized estimators 
like the Lasso where many parameter values are set identically to zero, also 
favor parsimony. Recently, several authors have begun the task of assessing 
uncertainties or estimation error of model parameter estimates in a wide 
variety of models models with high dimensional regressors (see, for exam¬ 
ple, [5]; [3]; [32]; [6]; [9]; [39]; [H]; and [8]). 

Quantifying estimation precision has been shown to be difficult theoreti¬ 
cally (for formal statements, see |31j . [25]) becanse model selection mistakes 
and regnlarization typically bias estimates to the same order of magnitude 
(relative to the sample size) as estimation variability. This paper builds on 
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methodology found in [9] (named Post-Double-Selection) which gives robust 
statistical inference for the slope parameter of a treatment variable x with 
high-dimensional confounders in the context of a partially linear model 
E[y\x, z] = ax + z'r/H The method selects elements of z in two steps: step 
1 selects the terms in z that are most useful for predicting x, and step 2 
selects elements of z most useful for predicting y in a second step. The use 
of two model selection steps is motivated partially by the intuition that two 
necessary conditions for omitted variables bias to occur: an omitted vari¬ 
able exists which is ( 1 ) correlated with the treatment x, and ( 2 ) correlated 
with the outcome y. Each selection step addresses one of the two concerns. 
In their paper, they prove that under the regularity right conditions, the 
two described model selection steps can be used to obtain asymptotically 
normal estimates of a and in turn to construct correctly sized confidence 
intervals. This paper generalizes the approach from estimating a linear 
treatment model to estimating a component in nonparametric additively 
separable models. The main technical contribution is providing conditions 
under which nonparametric estimates for functionals of g{x) are uniformly 
asymptotically normal after model selection on a conditioning set over a 
large set of data generating processes. 

2. A HIGH DIMENSIONAL ADDITIVELY SEPARABLE MODEL 

This section provides an intuitive discussion of the additively separable 
nonparametric model explored in this paper. Recall the additive conditional 
expectation model described in the introduction: 

E[y\x, z] = g^(x) = g(x) h(z). 

The interest is in recovering the function g(x) which describes the conditional 
relationship between the treatment variable of interest, x, and the outcome 
y. The component functions g and h belong to ambient spaces g ^ Q,h £ 71 
which restricted sufficiently to allow / and g to be uniquely identifiedH The 
function h and the variable z will be allowed to depend on n to facilitate 
a high-dimensional thought experiment for the conditioning set. As an ex¬ 
ample, this allows estimation of models of the form E[y\x, z] = g{x) -|- z'^(3n 
with dim(z„)—>-oo. Dependence on n will be supressed for ease of notation. 
The formulation will allow x and z to share variables to some extent. For in¬ 
stance, the setup will allow for additive interaction models like those found in 
Andrews and Whang (1991) so that the model E[y\x, z] = g{x)+'y-x-z+h{z) 
where 7 is an unknown scalar. 

The estimation of {g, h) proceeds by a series approximation with a dic¬ 
tionary that is partially data-dependent. As a review, a standard series 
estimator of the conditional expectation function without conditioning set, 

^Their formulation is slightly more general because it allows the equality to be an 
approximate equality. 

^For example, it necessary to require additional conditions like g{0) = 0, otherwise, at 
best, g and h are identified up to addition by constants 
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g{x) = -E[?/|x], is obtained with the aid of a dictionary of transformations 
p^{x) = {pik{x), ...,pkk{x))'■ The dictionary consists of a set of K func¬ 
tions of x with the property that a linear combination of the pjxix) can 
approximate g to an increasing level of precision that depends on K. K 
is permitted to depend on n and p^{x) may include splines, fourier series, 
orthogonal polynomials or other functions which may be useful for approx¬ 
imating g. The series estimator is simple and implemented with standard 
least squares regression. Given data {yi,Xi) for i = l,...,n, a series estima¬ 
tor for g is takes the form: ^(x) = p^{x)'l3 for P = [p^(x), ...,p^(xn)]', 
Y = (yi,...,y„)' and /3 = {P'P)~^P'Y. Traditionally, the number of series 
terms, chosen in a way to simultaneously reduce bias and increase precision, 
must be small relative to the sample size. Thus the function of interest 
must be sufficiently smooth or simple in order for nonparametric estimation 
to work well. The econometric theory for nonparametric regression esti¬ 
mation using an approximating series expansion is well-understood under 
standard regularity conditions; see, for example, [36], m, m- 

Series estimation is particularly convenient for estimating models of 
the form of (jl.2p because they can be approached using two dictionaries, 
p^{x),q^{z) consisting of K and L terms which individually approximate 
g{x) and h{z). The dictionaries can simply be combined into one larger dic¬ 
tionary. To describe the estimation procedure in this paper, suppose such 
a dictionary, {p^(x), (z)) = {pik{x), ...,pKK{x),qiL{x), .■.,qLL{z)), com¬ 
patible with the additively separable decomposition exists and is known. In 
what follows, dependence on K and L is suppressed in the notation so that 
p^{x) = p{x) and q^{x) = q{x). 

The two dictionaries differ in nature. The first dictionary, p{x) is tra¬ 
ditional, and follows standard conditions imposed on series estimators, for 
example, |29] . requiring among other conditions, that K —^ oo^Kjn —>• 0. 
The first dictionary must be chosen to approximate the function g{x) suf¬ 
ficiently well so that if h{z) were known exactly, g{x) could be estimated 
in the traditional nonparametric way and inferences on functionals of g{x) 
would be reliable. 

The second dictionary, q{z), is afforded much more ffexibility. This is con¬ 
venient and appropriate when entertaining a high dimensional conditioning 
set 2 ;. When the problem of interest is in recovering and performing in¬ 
ference for g{x), the second component h{z) may be considered a nuisance 
parameter. In particular, this paper will not be concerned with constructing 
confidence intervals for h{z), and therefore the requirements on the magni¬ 
tude of bias in estimating h{z) will less stringent. As a consequence, model 
selection bias of estimates of h{z), when done according to the method be¬ 
low, will have negligible impact on the coverage probabilities of condifence 
sets. Increased flexibility in modeling h{z) by allowing L > n can make 
subsequent inference for g{x) more robust, but requires additional structure 
on q{z). The key additional conditions are sparse approximation conditions. 
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The first sparsity requirement is that there is a small number of components 
of q{z) that adequately approximate the function h{z). The second sparsity 
requirement is that information about functions h € 71 conditional on Q can 
be suitably approximated using a small number of terms in q{z). The iden¬ 
tities of the contributing terms, however, can be unknown to the researcher 
a priori. 

Aside from estimating an entire component of conditional expectation 
function g{x) itself, a goal of this paper is to obtain asymptotically normal 
estimates of certain functionals of g{x). Given a functional a{g), that sat¬ 
isfies certain regularity conditions, the model selection procedure on q(z) 
will deliver a model such that subsequent plug-in estimates a{g) will be 
asymptotical normal around a{g). Such functionals include integrals of g, 
weighted average derivatives of g{x), evaluation of g{x) at a point and 
arg max g{x). 


3. Estimation 

When the number of free parameters is larger than the sample size, model 
selection or regularization is necessary. There are a variety of different model 
selection techniques available to researchers. A popular approach is via the 
Lasso estimator given by [l8] and m which in the context of regression, 
simultaneously performs regularization and model selection. The Lasso is 
used in many areas of science and image processing and has demonstrated 
good predictive performance. Lasso allows the estimation of regression coef¬ 
ficients even when the sample size is smaller than the number of parameters 
by adding to the quadratic objective function a penalty term which mechan¬ 
ically favors regression coefficients that contain zero elements. By taking 
advantage of ideas in regularized regression, this paper demonstrates that 
quality estimation of g{x) can be attained even when K + L, the effective 
number of parameters, exceeds the sample size n. Estimating proceeds by 
a model selection step that effectively reduces the number of parameters to 
be estimated. There are many other sensible candidates for model selection 
devices in the statistics and econometrics literature. The appropriate choice 
of model selection methodology can be tailored to the application. In addi¬ 
tion to the Lasso, variants of Lasso like the group-Lasso, the Scad (see m), 
the BIG, the AIC all feasible examples. In the exposition of the results, the 
model selection procedure used will be specifically the Lasso because it is 
simple and widely used. The section 3.2 below provides a brief review of 
Lasso, especially those that arise in econometric applications. 

Estimation of E[y\x,z] will be based on a reduced dictionary {p{x),q{z)) 
comprised of a subset of the series terms in p{x) and q{x). Because the 
primary object of interest is g{x), it is natural to include all terms belonging 
to p{x) in the reduced dictionary, giving p{x) = p{x). Therefore, the main 
selection step involves choosing a subset of terms from q{z). Given a model 
selection procedure which provides a new reduced dictionary q{z), containing 
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L < L series terms, the post-model selection estimate of g{x) is defined by 

g{x) = p{x)'P 

where [^' i/]' := ([P Q]'[P Q])-\P Q]'Y. 

Since estimation of h{z) is of secondary concern, only the components of 
h{z) predictive of g{x) and y need to be estimated. These two predictive 
goals will guide the choice of model selection procedure as described in the 
upcoming sections. The results will demonstrate that under standard reg¬ 
ularity, the post-model selection estimates give convergence rates for g{x) 
which are the same as in classic nonparametric estimation as well as as¬ 
ymptotic normality results for plug in estimators, a{g) = 0 ( 5 ) of nonlinear 
functionals a of the underlying conditional expectation function. 

3.1. Nonparametric Post-Double Selection in the Additive Model. 

The main challenge in statistical inference after model selection is in attain¬ 
ing robustness to model selection errors. When coefficients are small relative 
to the sample size (ie statisticaUy indistinguishable from zero), model selec¬ 
tion mistakes are unavoidableO When such errors are not accounted for, 
subsequent inference has been shown to be potentially severely misldead- 
ing. This intuition is formally developed in m and [ 25 ] . Offering solutions 
to this problem is the focus of a number of recent papers; see, for exam¬ 
ple, m, [ 3 ], m, [6], p, [ 39 ], m, [g, and 00 This section extends the 
approach of [9] to the nonparametric setting. 

Informally, model selection in the additively separable model proceeds in 
two steps. The two selection steps are based on the observation that the 
functional relation g{x) can be learned with knowledge of the conditional 
expectations 

E[p{x)\z] (3.3) 

E[y\z] (3.4) 

for a robust enough family of test functions p{x), for instance, smooth func¬ 
tions with compact support. Equivalently, the relation g{x) can be learned 
by projecting out the variable z and working with residuals. In the additively 
separable model, the two selection steps are summarized as follows: 

(1) First Stage Model Selection Step - Select those terms in q which are 
relevant for predicting terms in p. 

(2) Reduced Form Model Selection Step - Select those terms in q which 
are relevant for predicting y. 


^Under some restrictive conditions, for example beta-min conditions which constrain 
nonzero coefficients to have large magnitudes, perfect model selection can be attained. 
^Citations are ordered by date of first appearance on arXiv. 



DAMIAN KOZBUR 


To further describe the selection stages, it is convenient to ease notation 
by introducing an operator T on functions that belong to G: 

Tip{z) = E[ip{x)\z] 

This notion is convenient for understanding the validity behind post dou¬ 
ble selection in the additively separable model. The operator T measures 
dependence between functions in the ambient spaces 0,^. which house the 
functions g, h and the conditioning is understood to be on all function ip ^T-L. 

If the operator T can be suitably well approximated, then the post double 
selection methodology generalizes to the nonparametric additively separable 
case. The operator T on (j) will be approximated as a linear combination, 
given by of basis terms q so that 

Tip{z) Ri 

Meanwhile, T,^ is approximated with linear combinations of Tp ^, 1 ^ k ^ K. 
The final selected model q consists of the union of terms selected during the 
first stage model selection step and the reduced form model selection step. 
A practical implementation algorithm is provided in Sections.3 

[9] develop and discuss the post-double-selection method in detail for 
partially linear model. They note that including the union of the variables 
selected in each variable selection step helps address the issue that model se¬ 
lection is inherently prone to errors unless stringent assumptions are made. 
As noted by [25], the possibility of model selection mistakes precludes the 
possibility of valid post-model-selection inference based on a single Lasso re¬ 
gression within a large class of interesting models. The chief difficulty arises 
with covariates whose effects in (|3.3p are small enough that the variables 
are likely to be missed if only (|3.3p is considered but have large effects in 
(|3.4p . The exclusion of such variables may lead to substantial omitted vari¬ 
ables bias if they are excluded which is likely if variables are selected using 
only ([331)0 Using both model selection steps guards against such model 
selection mistakes and guarantees that the variables excluded in both model 
selection steps have a neglible contribution to omitted variables bias under 
the conditions listed below. 

3.2. Brief overview of Lasso methods. The following description of the 
Lasso estimator is a review of the particular implementation given in [3|. 
Consider the conditional expectation E[y\w\ = f{w) and assume that q{w) 
is an approximating dictionary for the function f{w), so that f{w) « Q{w)''d, 
with dimension M = dim(^)(tc)). The Lasso estimates for and f{vj) are 
defined by 

n M 

d ^ arg min - Q{wi)'tf + 

Z=1 J=1 

®The same is true if only (inp is used for variable selection exchanging the roles of 
(IT^ and (ITil) . 
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fha.ssoi'w) = Q{wyd 

where A and are tuning parameters named the penalty level and the 
penalty loadings. [3] provided estimation methodology as well as results 
guaranteeing performance for the Lasso estimator under conditions which 
are common in econometrics including heteroskedastic and non-Gaussian 
disturbances. Tuni^ parameters are chosen to balance regularization and 
bias considerations 13 Performance bounds for the Lasso, including rates at 
which approach zero, are 

derived on the what is called the Regularization event. The Regularization 
event is defined by = {A > for each j ^ L}, for a fixed 

constant c > 1 where Sj is the partial derivative of the least squares part 
of the objective in the jth direction. Informally, under the regularization 
event, the penalty level is high enough so that coefficients which cannot be 
statistically separated from zero are mechanically set to identically zero as a 
consequence of the aboslute values in the above objective function. Because 
performance bounds are directly proportional to A, Lasso can be shown to 
perform well for small values of A which are nevertheless large enough so 
that the regularization event occurs with high probability. 

Lasso performs particularly well relative to some more traditional regu¬ 
larization schemes (eg. ridge regression) under sparsity: the parameter q 
satisfies |{j : / 0}| ^ s for some sequence s « n. A feature of the na¬ 

ture of the Lasso penalty that has granted Lasso success is that it sets some 
components of g to exactly zero in many cases. Under general conditions, 

I = \j '■ 7^ 0| ^ Cs with probability 1 — o(l) 

for a constant C that depends on the problem. The Post-Lasso estimator 
is defined as the least squares series estimator that considers only terms 
selected by Lasso (ie terms with nonzero coefficients): 


/Post-Lasso('R^) — Q{'^} ^^Post-Lasso) ^^Post-Lasso £ argmin - Q{Wi)'tf 

{t-.tj=Q, Vj^/} i=i 


3.3. Lasso in post-double selection in the additively separable 

model. In this section, the use of Lasso is applied directly to the first and 
second stage problems described in section 3.2 Starting with constructing 


7 


For the simple heteroskedastic Lasso above, [3] recommend setting 


A = 2cyii:-F"^(l-7/2M), 




n 




i=l 


f(wi)y/n 


with 7 —>■ 0 sufficiently slowly, and c > 1. The choices 7 = log ^ n and c = 1.1 are 
acceptable. The exact values f{wi) are unobserved, and so a crude preliminary estimate 

Tiwi) = i Yli=i Vi is used to give = \/l]r=i QiiwiYiVi - /(wi))L Estimates of J{wi) 
can be iterated on as suggested by [3]. The validity of the of the crude preliminary estimate 
as well as iterative estimates are detailed in the appendix. 
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an approximation to the operator T. Each component of p is regressed onto 
the dictionary q giving an approximation for Tpk{x) as a linear combina¬ 
tion of elements q{x) for \ ^ k ^ K. If this can be done with all pk, for 
each 1 ^ k ^ K, then T applied to a linear combination of p{x), namely 
Tp{x)'(3, can also be approximated by a linear combination of elements of 
q. The estimation can be summarized with one optimzation problem which 
is equivalent to K separate Lasso problems. All nonzero components of the 
solution to the optimization are collected and included as elements of the 
refined dictionary p. 


K n K L 

^ k=i i=i k=ij=i 

Note that the estimate T approximates T in the sense that {q{x)'TyjS ap¬ 
proximates Tpi{x)'(3. The first stage tuning parameters chosen 

similarly to the method outlined above but account for the need to estimate 
effectively K different regressions. Set 

= 2cVn^-\l - -f/2KL), 




FS 

jk 


n 




'^<lj{ziY{Pk{xi) -Tpk{xi)Y/n. 

i=\ 


As before, the are not directly observable and so estimates are 

used in their place. The mechanical implementation for calculating is 

described in the appendix. Details for the constants involved in choosing A 
are also given in the appendix. The appearence of K term in A ensures that 
the performance of the Lasso model selection works uniformly well over over 
the K different Lassos of the first stage. 

Running the regression above will yield coefficient estimates of exactly 
zero for many of the T^j. For each 1 ^ j ^ iL let /^ = {j : T^j 7^ 0}. Then 
the first stage model selection step selects exactly those terms which belong 
in the union = Ii U ... U li^. 

The reduced form selection step proceeds after the first stage model se¬ 
lection step. For this step, let 


TT = 


argmin - q{xi)'-K)‘^ + A 

TT ^ ^ 


i=l 






RF 


Fi 


Where the reduced form tuning parameters A^^, are chosen accord¬ 
ing to the method outlined above with 

= 2cVn^”^(l - 7/2L), 
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J 


n 


'i 


'^(lj{ziY{yi - E[yi\zi]Y/n. 
1=1 


Let = {j : TTj / 0} be the outcome of the reduced form step of model 
selection. 

Considering the set of dictionary terms selected in the first stage and 
reduced form model selection steps. Let I be the union of all dictionary 
terms: I = U 7^^. Then define the refined dictionary by {p{x),q{z)) = 
{p{x), {q{z )}Let [P Q] be the nx {K+\I\) matrix with the observations 
of the refined dictionary stacked. The post-double-model selection estimate 
for g{x) is defined by 

g{x)=p{x)'P (3.5) 

where [^' i/]' := ([P Q]'[P Q])-^[P Q]'T. 


4. Regularity and Approximation Conditions 


In this section, the model described above is written formally and con¬ 
ditions guaranteeing convergence and asymptotic normality of the Post- 
Double Selection Series Estimator are given. 

Assumption 1. (i) {yi,Xi,Zz) are i.i.d. random variables and satisfy 
E[yi\xi,Zi] = g{xi) + h{zi) with g G G and h gP for pre-specified classes of 
functions G,P. 

The first assumption specifies the model. The observations are required 
to be identically distributed, which is stronger than the treatment of i.n.i.d 
variables given in Belloni, Chernozhukov and Hansen (2011). 

4.1. Regularity and approximation conditions concerning the first 
dictionary. The following few definitions help characterize smoothness 
properties of target function g and approximating functions p. Let/ 
be a function defined on the support X of x. Define the Sobolev 
norm |/|rf = sup^.^^ max|a|,grf In addition, let Cd{K) = 

max|a|^rfSup2,gjjf ||dl“lp(x)/d3:“|| where || • || denotes the Euclidean norm. 
Throughout the exposition, all assumptions will be required to hold for 
each n with the same set of implied constants. 

Assumption 2. There is an integer d ^ 0, a real number a > 0, and vectors 
P = Pk such that ||/3|| = 0(1) and \g — p'f3\d = 0{K~°^) as K ^ oo. 

Assumption 2 is standard in nonparametric estimation. It requires that 
the dictionary p can approximate g aX, a, pre-specified rate. Values of d and 
a can be derived for particular classes of functions. [29] gives approxima¬ 
tion rates for several leading examples, for instance orthogonal polynomials, 
regression splines, etc. 
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Assumption 3. For each K, the smallest eigenvalue of the matrix 
E [{p{x) - Tp{x){z)){p{x) - Tp{x){z)y] 

is bounded uniformly away from zero in K. In addition, there is a sequence 
of constants Co{K) satisfying sup3,g_:i^ l|p(3;)ll ^ Co(^) o.nd Co{K)‘^K/n —)• 0 
as n ^ oo. 

This condition is a direct analogue of a combination of Assumption 2 
from Newey (1997) and the necessary and sufficient conditions for esti¬ 
mation of partially linear models from [32]. Requiring the eigenvalues of 
E [(p(x) — Tp{x){z)){p{x) — Tp{x){z)y] to be uniformly bounded away from 
zero is effectively an identifiability condition. It is an analogue of the stan¬ 
dard condition that E\p{x)p{xy] have eigenvalues bounded away from zero 
specialized to the residuals of p{x) after conditioning on z. The second 
condition of Assumption 3 is a standard regularity condition on the hrst 
dictionary. 

4.2. Sparsity Conditions. The next assumptions concern sparsity proper¬ 
ties surrounding the second dictionary q{z), used for approximating h{z). As 
outlined above, sparsity will be required along two dimensions in the second 
dictionary: both with respect to the outcome equation (1) and with respect 
to the functional T. Consider a sequence s = Sn that controls the number of 
nonzero coefficients in a vector. A vector X is s—sparse if \{j : Xj 0}\ y s. 
The following give formal restrictions regarding the sparsity of the outcome 
equation relative to the second approximating dictionary as well as a sparse 
approximation of the operator T described above. 

Assumption 4. Sparsity Conditions: there is a sequence s = Sn and (p = 
s log(max{i7L, n}) such that 

(i) Approximate sparsity in the outcome equation: there is a se¬ 

quence of vectors g = t]l that are s-sparse and the approximation 
\/Yll=i{h{zi) — q{ziygY/n := = Op{-\/4>/n) holds. In addition, 

maxjs:„ \h{zi) - qizifgl = op(l). 

(ii) Approximate sparsity in the first stage. There are s—sparse T^ = 
Tk,L such that maxfcs:/^ \IYa=i {E\pk{xi)\zi] - q{ziyTkf /n := ips = 

Op{^/4>/n). In addition, maxj^^^fcsgA \E\pk{xi)\zi] - qizifTy = op(l). 

(in) s=o(n) 

The assumption above imposes only a mild condition on the sparsity s and 
in a sense may be thought of as definitional. In the discussion that follows, 
additional conditions on the size of the sparsity level s will be imposed. As a 
preview, the conditions listed in Assumption 7 will require that K(fin~^l‘^ = 
As log(max{i7L, —>• 0, among other conditions. 

The hrst statement requires that the second dictionary can approximate 
h using a small number of terms. The average squared approximation error 
from using a sparse g must be smaller than the conjectured estimation error 
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when the subset of the correct terms is known. This restriction on the ap¬ 
proximation error follows the convention used by [3] . The second restriction 
on the maximum approximation error is used to simplify the proofs. The 
second statement of Assumption 4 generalizes the first approximate sparsity 
requirement. It requires that each component of the dictionary p can be 
approximated by a linear combination of a small set of terms in q. 

Additional discussion of the sparsity assumptions are given in Section 
6 which addresses issues arising in the implementation of non-parametric 
post-double estimates. 

4.3. Regularity conditions concerning the second dictionary. The 

following conditions restrict the sample Gram matrix of the second dictio¬ 
nary. A standard condition for nonparametric estimation is that for a dictio¬ 
nary P, the Gram matrix P'P/n eventually has eigenvalues bounded away 
from zero uniformly in n with high probability, li K + L > n, then the ma¬ 
trix [PQ]'\PQ]/n will be rank dehcient. However, in the high-dimensional 
setting, to assure good performance of Lasso, it is sufficient to only con¬ 
trol certain moduli of continuity of the empirical Gram matrix. There are 
multiple formalizations of moduli of continuity that are useful in different 
settings, see [TO], [33] for explicit examples. This paper focuses on a simple 
condition that seems appropriate for econometric applications. In particu¬ 
lar the assumption that only small submatrices of Q'Qjn have well-behaved 
eigenvalues will be sufficient for the results that follow. In the sparse set¬ 
ting, it is convenient to dehne the following sparse eigenvalues of a positive 
semi-dehnite matrix M: 


/ , ,,,,, 6'M6 

(Pram{m){M) := mm -[7777^, := max 7- 

l=£||<5||o=£m PP K||<5||os:m PP 

In this paper, favorable behavior of sparse eigenvalues is taken as a high 
level condition and the following is imposed. 

Assumption 5. For every constant C > 0 there are constants k" > k' > 0 
which may depend on C such that with probability —)• 1, the sparse eigenval¬ 
ues obey 

k' < ipyam{CsK){Q'Q/n) ^ praa.ACsK){Q'Q/n) ^ k". 

Assumption 5 requires only that certain “small” CsK x CsK submatrices 
of the large p x p empirical Gram matrix Q'Qjn are well-behaved. This 
condition seems reasonable and will be sufficient for the results that fol¬ 
low. Informally it states that now small subset of covariates in q suffer a 
multicollinearity problem. The could be shown to hold under more primi¬ 
tive conditions by adapting arguments found in |3| which build upon results 
in m and [351; see also [34] . 
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4.4. Moment Conditions. The next conditions are high level conditions 
about moments and the convergence of certain sample averages which ensure 
good performance of the Lasso as a model selection device. They allow the 
use of moderate deviation results given in |22j which ensures good perfor¬ 
mance of Lasso under non-Gaussian and heteroskedastic errors. [9] discuss 
plausibility of these types of moment conditions for various models for the 
case K = 1. For common approximating dictionaries for a single variable, 
the condition can be readily checked in a similar manner. 

Assumption 6. Set e = y — g{x) — h{z). For each k ^ K let 
Wk = Pk{x) - Tpk{x) and define zufjf = {E [\qj{zi)Wik\^]Y^^ , = 

(yE \^qj{zi)ei\^'\') ' . Let c,C be constants that do not depend on n. The 
following conditions are satisfied with probability 1 — o(l) for each 1 ^ k ^ 
A, 1 < j < L,: 

(z) c < E[q,{z,)h%E[qj{zifwl] < C 

(a) 1 ^ maxTf^/minTf^,maxTf//minTf;f ^ C 

{Hi) 1 ^ nmx , n^ '^f^l \J ^ C 

(iv) log^(iLL) = o(n) and slog{KL) = o(l) 

,, I'i'S" - \/e(5Wi ur - \/iTF)"i 

(u) maxmax- — -,max- — -= o(l) 

k^K j^L yjEi^^Sy j^L yjE{^f^)^ 

{vi) ^ Tjf ^ ^ ^ 

4.5. Global Convergence. The first result is a preliminary result which 
gives bounds on convergence rates for the estimator g. They are used in the 
course of the proof of Theorem 1 below, the main inferential result of this 
paper. The proposition is a direct analogue of the rates given in Theorem 
1 of |29] which considers estimation of a conditional expectation g without 
model selection over a conditioning set. The rates obtained in Proposition 
1 match the rates in [29j . 

Proposition 1. Under assumptions listed above, the post-double-model- 
selection estimates for the function g given in equation lg.5l satisfy 

j{g{x) - g{x))‘^dF{x) = Op{K/n + 

\g- g\d = Op{Qd{n)^/Vn +K~°‘). 

5. Inference and asymptotic normality 

In this section, formal results concerning inference are stated. Consider 
estimation of a functional a on the class of functions Q. The quantity of 










ADDITIVELY SEPARABLE 


15 


interest, 6 = 0(5), is estimated by 

e = a{g). 

The following assumptions on the functional a are imposed. They are reg¬ 
ularity assumptions that imply that a attains a certain degree of smoothness. 
For example, they imply that a is Frechet differentiable. 

Assumption 7. Either (i) a is linear overQi, or (ii)ford as in Assumption 
2, Q{K)^K‘^/n —)• 0. In addition, there is a linear function D{f,f) that is 
linear in f and such that for some constants C,n > 0 and all /, / with 
\f - 9\d < \f - g\d < V, it holds that \\a{f) - a{f) - D{f - f;f)\\ ^ 
C{\f - M and \\D{f- f) - D{f- /)|| ^ L\fU\f - fU- 

The function D is related to the functional derivative of a. The following 
assumption imposes further regularity on the continuity of the derivative. 
For shorthand, let D{g) = D{g;gQ). 

Assumption 8. Either (i) a is scalar, \D{g)\ ^ C\g\d- There is ^ dependent 
on K such that for g{x) = p{xyp, it holds that E[g{x)‘^] —)■ 0 and D{g) ^ 
C > 0; or (a) There is v{x) with E[v{x)v{xy] finite and nonsingular with 
D{g) = E[v{x)g{x)] and D{pk) = E[v{x)pk{x)] for every k. There is j3 so 
that E[\\v{x) — p(x)'/3|p] —0. 

In order to use 9 for inference on 9, an approximate expression for the vari¬ 
ance var{9) is necessary. As is standard, the expression for the variance will 
be approximated using the delta method. An approximate expression for the 
variance of the estimator 9 therefore requires an appropriate derivative of the 
function a, (rather, an estimate). Let A denote the derivatives of the func¬ 
tions belonging to the approximating dictionary, A = {D{pi),D{pK)y■ 
Let A = approximate variance, from the delta method is 

given hy V = Vk'- 

V = AQ-^SQ-^A 

n = El(p(x) — Tp(x))(p(x) — Tp(x)yj 

S = El(p(x) - Tp(x))(p(x) - Tp(x)y(y - g{x)f] 


These quantities are unobserved but can be estimated: 

V = 

n 

^ - P{xi)){p{xi) -p{xi)y/n 

i=l 
n 

^ = '^{p{a^i) -Kxi)){p{xi)-Kxi))'{v-g{xi)f/n 
i=l 
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The elements Pk{xi) are obtaining as the predictions from the least squares 
regresson of Pk{xi) onto the selected q{zi). Then V is used as an estimator 
of the asymptotic variance of 9 and assumes a sandwich form. 

The following assumptions are needed in order to bound V — V: 

Assumption 9. Define the moments = {E [|gj(2:j)ejp]) ^ = 

[E Let q > 4,0 > 0 be constants. Then for k ^ K, j L 

(i) E[\ei\'^\xi, Zi] ^ C 


(ii) wfj^ElqfiziYe^w^yE[Wle‘f] ^ C 
(in) = o(l) 


{iv) 





max\qj{zi)\ = op(l) 


The assumption that q > 4 moments of e are bounded slightly strengthens 
the condition in [29] that fourth moments are bounded. The condition is 
necessary for the estimation of the final standard errors. Similarly, condition 

(iii) is stronger than the rate condition listed in Assumption 3. The more 
stringent condition is also useful for estimating standard errors. Finally, 
condition (ii) is analogous to Assumption 7, condition (iii) and is again 
useful for controlling the tail behavior of certain self-normalized sums. 

The next result is the main result of the paper. It establishes the validity 
of standard inference procedure after model selection as well as validity of 
the plug in variance estimator. 

Theorem 1. Under the Assumptions 1-7,9 and Assumption 8(i), and in 
addition y/nK~°^ —>• 0 then 6 = 9 Op{Cd{K)Vn) and 

^F-V2(0 _ 0) 4 ^(0,1), ^V-^/‘^{9- 9) A A(0,1) 

If Assumptions 1-7,9 and Assumption 8(ii) hold with d = 0 and in addi¬ 
tion y/nK~°‘ —> 0 then for V = E[v{x)v{xyvar{y\x)], the following conver¬ 
gences hold. 

v^(0-0)4 A(o,F), ||y-F||4o 

The theorem shows that the outlined procedure gives a valid method for 
performing inference for functionals after selection of series terms. Note that 
under assumption 8(i) the y/n rate is not achieved because the functional a 
does not have a mean square continuous derivative. By contrast, Assumption 
8(ii) is sufficient for yTi-consistency. Conditions under which the particular 
assumptions regarding the approximation of g hold are well known. For 
example, conditions on K for various common approximating dictionaries 
including power series or regression splines etc follow those directly derived 
in [29|. Asymptotic normality of these types of estimates under the high 
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dimensional additively separable setting should therefore be viewed as a 
corollary to the above result. 

Consider one example with the functional of interest being evaluation of g 
at a point x^: a{g) = g{xo). In this case, a is linear and D{g) = g{xo) for all 
functions g. This particular example does not attain a y/n convergence rate 
provided there is a sequence of functions gx in the linear span oi p = p^ 
such that E[gK{x)‘^] converges to zero but gK{xo) is positive for each K. 
Another example is the weighted average derivative a{g) = f w{x)dg{x)/dx 
for a weight function w which satisfies regularity conditions. For example, 
the theorem holds if w is differentiable, vanishes outside a compact set, 
and the density of x is bounded away from zero wherever w is positive. In 
this case, a{g) = E[v{x)g{x)] for v{x) = — f {x)~^dw{x)/dx by a change 
of variables provided that x is continuously distributed with non vanishing 
density /. These are one possible set of sufficient conditions under which 
the weighted average derivative does achieve -yn-consistency. 

6 . Additional discussion of implementation 

The theorem above states that for a fixed (nonrandom) sequence K = 
Kn, asymptotically normal estimates are achieved for certain functionals of 
interest of g{x) under the right regularity conditions. In practice, the choice 
of Kn is important and it is useful to have a data-driven means by which 
to choose Kn = K = K{{{xi,yi)}^^j^). This section provides suggestions for 
choosing such K. This paper leaves these suggestions as heuristics and does 
not derive formal theory for their asymptotic performance; however, the 
finite sample performance of these heuristics is explored in the simulations 
in Section 7. □ 

Suppose that candidates for K belong to the integer set iK . K}. 

Suppose that L = Ln is nonrandom as before and does not vary with 
K E K}. A simple proposal is as follows: for each K, construct 

using the post-double model selection routine, a reduced second dictionary 
q^. This results in a set of selected dictionaries which are candidates for a 
final estimation step; 

Then K can be chosen by selecting from the dictionaries in the above set. 
In principal this can be done in many ways, including cross validation or 
BIC, possibly with additional over-smoothing (i.e. choosing K larger than 
say the mean-square error optimal.) 

®There is also the question of whether the penalty levels and in the Lasso 
optimizations can be chosen in a more data-driven way. There is less flexibility in these 
choices, since the Lasso bounds are predicated on the Regularization event. Using a 
smaller penalty level than suggested leads to over-selection of control variables, which can 
bias estimates. Further discussion of the effects of post-model-selection inference with 
over-selection can be found in [7]. 
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The informal reasoning behind this proposal is that for each K, the post¬ 
double model selection selects confounders in a way so that the omitted 
variables bias resulting from possibly omitting covariates predictive of ele¬ 
ments of p^{x) is small relative to sampling variability. But there can be 
another component of omitted variables bias: namely, the omitted variables 
biases resulting in excluding all confounders predictive only of signal in x 
which is not accounted by p{x). However, this second component of omit¬ 
ted variables bias will plausibly be small if K is chosen appropriately, (i.e. 
whenever y/nK~°‘ —)• 0, a similar bound on such omitted variables bias may 
be expected to hold.) 

In addition to issues arising from choosing K, the sparsity conditions 
imposed in the previous sections are restrictive. However, without such a 
sparse structure, it is difficult to construct meaningful estimates of g{x). 
Furthermore, at the current time, there are no widely used procedures to 
test the null hypothesis of a sparse model that the author is aware of. 

A major restriction implicit in the sparsity assumption is that the sparse 
approximation errors are small for all terms in p{x). For instance, if p{x) = 
(x, ...), it is much less demanding to ask that there exists a good sparse 

predictor of x based on q{z), than it is to ask that x, x^, x^... all have quality 
sparse predictors. On the other hand, it is possible that the transformations 
x^ -|- x^ or x^ — x^ might have much better sparse representations in terms 
of q{z) than x^ and x^ have individually. 

Therefore, an alternative strategy for the first stage is to have a model 
selection step for many distinct linear combinations a'p{x) given by a £ 
The strategy is outlined as follows. First, gather the set {a'p{x) : a G 
A C into an extended first stage dictionary: p-ps{x). Select a reduced 
conditioning dictionary q{z) with the nonparametric post-double selection 
method described above, except using pfs{x) in the first stage model se¬ 
lection step. Finally, in the post-model-selection estimation step, estimate 
using {p(x),q{z)). This strategy is potentially useful since it further reduces 
the possibility for omitted variables bias. A clear tradeoff with using a dis¬ 
tinct first stage dictionary is that due to the additional model selection steps 
introduced, more variables from p{z) can potentially be selected, leading to 
higher variability of the final estimate of g{x). 

As with the data-driven choice of K, this suggestion is kept at the level 
of a heuristic at this moment. However, arguments in the proofs of the 
main results can easily be extended to allow an extended to allow a first 
stage dictionary pfs{x) provided that it has a sub dictionary, p^{x) for 
which Assumptions 1-4 hold, and that the number of selected condition¬ 
ing variables q{z) remains 0{sK) with high probability. For example, if 
dim(pFs(3:)) ^ Cdim(p(x)) then no substantive modification to the proof 
are necessary. 

The finite sample performance of these heuristics is explored in the sim¬ 
ulations in Section 7. 
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7. Simulation study 

The results stated in the previous section suggest that post double selec¬ 
tion type series estimation should exhibit good inference properties for ad- 
ditively separable conditional expectation models when the sample size n is 
large. The following simulation study is conducted in order to illustrate the 
implementation and performance of the outlined procedure. Results from 
several other candidate estimators are also calculated to provide a com¬ 
parison between the post-double method and other methods. Estimation 
and inference for two functionals of the conditional expectation function are 
considered. Two simulation designs are considered. In one design, the high 
dimensional component over which model selection is performed is a large 
series expansion in four variables. In the other design, the high dimensional 
component is a linear function of a large number of different covariates. 

7.1. Low Dimensional Additively Separable Design. Consider the fol¬ 
lowing model of continuous variables x, z of form: 

E[y\x] = E[y\x, z] = g{x) + h{z) 

where in this simulation, the true function of interest, g{x), and the condi¬ 
tioning function h{z) are given by : 

g{x) = logistic(x) - ^ 

( dim(2:) 

Zj 

j=i 

where logistic(x) = and the ^ terms in the expression for g is used to 

ensure identifiability via g{0) = 0. Ex post, the function is simple, however, 
for the sake of the simulation, knowledge of the logistic form is assumed 
unknown. Importantly, the logistic function will not belong exactly in the 
span of any finite series expansion used in the below simulation. The second 
function h is similar, being defined by a combination of a logistic function 
of a linear combination of the z variables. The logistic part can potentially 
require many interaction terms unknown in advance to produce an accu¬ 
rate model. The component functions g and h will be used throughout the 
simulation. The remaining parameters, eg. dictating the data generating 
processes for (y, x, z) will be changed across simulation to give an illustration 
of performance across different settings. 

The objective is to estimate a population average derivative, and a 
function evaluation, given by 

(i) 01 = ai{g) = /^^pp(^) If (x)dF(x) 

(ii) 02 = a 2 {g) = 5(quantile(x, .75)) — (quantile(x, .25)). 
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9 i and 62 and estimates of standard errors are obtained using the post-double 
methodology outlined in the paper with approximating dictionaries specified 
below. The plug in estimate of the average derivative, 0i, is integrated 
against the empirical distribution of x. 

The covariates and outcome are drawn as follows. The distribution of the 

1/9 1 /9 

conditioning set z is set at z r\j ^(0>^dim(z)) where is the Toepliz 

matrix of size dim( 2 ;) with decay base of therefore, the correlations be- 

\ I 'I ft* I 

2) . The variable of interest, 

X is determined by x = h{z) -|- v with v ~ N{0, 1) • ay independent of z. 
The structural errors e = y — £^[ 2 /|x] are drawn A^(0,1) • Ug independent of 
X. Several simulations of this model are conducted by varying the sample 
size n, the dependence between x and the remaining regressors z, as well 
as the size of the residual errors e. The sample size varied and is set to 
n G {500,1000}. The dependence between x and the remaining covariates 
is dictated by ay. To capture high and low dependence between the covari¬ 
ates, the values ay G {1,2} are used. Finally, the variability of the structural 
shocks are set to a^ G {1,2}. 

Estimation is based on series expansion using Hermite polynomials. K 
is set to K = floor(n^/^). Therefore, g(x) is approximated using a K or¬ 
der polynomial. The series expansion q{z) for the function h{z) spans all 
polynomials in z of order ^ K. Specihcally, using interactions (products) of 
univariate m-order Hermite polynomials, Hm, on single components of z, the 
elements of q{z) consist of terms in jOj™ rrij ^ K^. 

In addition to the standard post-double lasso based model, several al¬ 
ternative estimates are calculated for comparison. We give three ad hoc 
estimators which seem to be sensible data-driven ways to choose the num¬ 
ber of series terms (see the description and discussion of these in Section 6). 
The hrst is the Post-Double selection where K = K is chosen from a set of 
values floor(ire^/^) ^ ^ floor(2 j.ji/3) choice is by the following pro¬ 

cedure. First, run post-double selection for each choice of K. This produces 
a set of candidate models {{p^,qj^),...,{p^,q^)} which can be compared 
each other. In this simulation, the preferred method of comparison is BIG 
since it is simple computationally, relative, for example, to cross-validation. 
Then the final value K is chosen to be iF = E^bic + 1- This estimator is 
referred to as the Post-Double Set estimator in the simulation results tables. 

In addition, the second suggestion from Section 6, which is to augment 
the first dictionary in the first stage model selection step, is considered. For 
given p{x), the extended first stage dictionary pps is constructed by 

PFs{x) = p{x) U {pj{x) +Pj'{x) : j < j'} U {pj{x) - pj>{x) : j < j'} 

which is used in place of p{x) in the first stage model selection step. This 
gives for fixed K an estimate which is called the Post-Double Ext estimate 
in the simulation results tables. 
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Finally, we consider a hybrid of the Post-Double Set estimator and the 
Post-Double Ext estimator where a model is selected using the Post-Double- 
Extend method for each K G {^, These models are compared and 

the choice K is made according to K = it^eic + 1- This estimator is called 
the Post-Double Set-|-Ext estimator in the simulation results tables. 

In addition to Post-Double-based estimators, two additional series es¬ 
timators which are designed to approximate the function g{x) -|- h{z) are 
compared. The first is based on a series approximation which assumes that 
h{z) is additively separable so that h{z) = hj{zj). The approximat¬ 

ing series for each term hj (zj) consist of Hermite polynomials in zj of order 
^ K. The second series estimator uses identical series as the post-double 
selection model, but performs no model selection; that is, it proceeds un¬ 
der the same mechanical procedure as would be used for a standard series 
estimator and uses the Moore-Penrose pseudo-inverse if it needs to invert a 
singular matrix. Second, a single step selection estimator is provided. The 
single step estimator is done by performing a first stage lasso on the union of 
the two dictionaries, then re-estimating coefficients of the remaining dictio¬ 
nary terms in the second stage. Finally, an infeasible estimator is provided, 
where estimation proceeds as standard series estimation given the dictionary 
p and as if h{xi) where known for each i. 

Results for estimating 9i and 62 are based on 500 simulations for each set¬ 
ting described earlier. For each estimator, the median bias, median absolute 
deviation, and rejection probability for a 5-percent level test of ifoi : ^1 = ^oi 
or Hq 2 : 02 = ^02 are presented. Results for estimating 9i are presented in 
Table 5.4. In each design, estimates of 9 based on post double selection ex¬ 
hibit small median absolute deviation relative to the competing estimates. 
With the exception of the infeasible estimates the post double selection 
estimates are also the only estimates which exhibit reasonable rejection fre¬ 
quencies consistently across all settings. Results for estimation of 02 are 
reported in Table 5.5. The results are qualitatively similar to those for 9i. 
The only reasonable rejection frequencies are obtained with the post-double 
selection. A small amount of size distortion can be seen as rejection fre¬ 
quencies are closer to 10-percent in most simulations. The distortion in the 
post double estimator matches that in the infeasible estimator suggesting 
that they are driven by bias in approximating a nonlinear function with a 
small bias rather than a consequence of model selection. 

7.2. High Dimensional Additively Separable Design. In this design, 
a high dimensional setting is considered: 

E[y\x, z] = E[y\x, zj = g(x) + h(z). 
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For the functions 


g{x) = logistic(x) - - 



In this model, the dimension of z is large compared to the sample size 
and is set to dim(^;) = floor (2n) for each scenario. The variables z are 

drawn as above from a Gaussian with 2 : ~ ^0, so that, again, the 

correlation structure is coii{xj,Xk) = (|) The target function g{x) 

remains the same as before. The dependence between x and 2 : is defined 
with x = h{z) + V. The specifications for v, ay, e, and are the same as 
they were in the low dimensional example. Estimation is again based on a 
series expansion in terms of Hermite polynomials in x. K is taken to be ilT = 
floorThe second dictionary comprises of the variables Zj themselves. 
The simulation evaluates the performance of the estimator when the goal 
is to control for many distinct possible sources of confounding. Results are 
recorded in Table 5.6 for sample size n = 500 and Table 5.7 for n = 1000. In 
this simulation, the single selection and infeasible estimators are defined as 
they were in the low dimensional simulation. The series estimator (Series I) 
uses the first floor(4n/5) covariates set of controls in estimation. The second 
series estimator (Series II) randomly selects floor(4n/5) of the covariates to 
use in estimation. The results are qualitatively similar to those for the 
first design. Relative to other methods, the post-double selection method 
exhibits substantially lower bias. The rejection frequencies obtained with 
the post-double selection are considerably closer to the target 5% level than 
with other methods. 

8 . Empirical Application: Estimating the Effect of Export on 

HIV 

The results in the preceding sections show how variable selection methods 
can be used to estimate additively separable models, in which the component 
of interest can be considered effectively randomly assigned conditional on 
observables. This section illustrates use of the results by reexamining m, 
which studies of the impact of trade on HIV incidence rates in a sample 
of African countries. The original results are briefly reviewed, after which 
estimates using the methods developed in this paper are calculated and 
discussed. 

The arguments presented in [30] propose two causal mechanisms relating 
trade to HIV incidence rates in African countries. The first is simply that 
increased income provided by more exports leads to an increase in risky 
behavior. The second mechanism is that increased trade leads to an increase 
in trucking and movement of people who engage in risky behavior. The basic 
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problem in estimating the causal impact of trade on HIV is that country- 
level export rates are not randomly assigned. It is likely that there are 
factors like government specific policies that are associated to both trade 
rates and HIV rates. 

The baseline model estimated in m is for country-level incidence rates 
running from 1985 to 2007, which conditions on country level fixed effects, 
time level effects as well as lagged HIV prevelanceH In the present analysis, 
the argument given in [30], that the export rates defined above may be taken 
as exogenous relative to HIV incidence once observables and government or 
policy variables are controlled for, is taken for granted. While m controls 
for these possible confounds by including country-specific and year-specific 
fixed effects and lagged prevelance, this paper allows a much richer set of 
variables to be used as controls. All of the new control variables are described 
below and can be conveniently constructed from the original dataset used by 
m- An additional benefit is that selection of a conditioning set offers a very 
attractive complimentary analysis to the standard practice of robustness 
checks. Robustness checks typically estimate several perturbations of the 
baseline model. If the value of causal estimates are insensitive to these 
perturbations, then this is considered evidence in favor of the researcher’s 
original conclusions. While robustness checks are useful, it is very difficult 
to come up with a principaled means for choosing a correct set of perturbed 
models to use. The methods here give one such formalized procedure and 
serve to compliment the robustness checks performed in I3Q|. 

This paper considers a model specified by: 


Vit = g{xit) + ai+-ft + w'n6 + z'n9 -h Sit 


where i indexes country, t indexes times, a* are country-specific effects that 
control for any time-invariant country-specific characteristics, •jt are time- 
specific effects that control flexibly for any aggregate trends, wu is the lagged 
HIV prevelance rate. The outcome yu is log measure of HIV incidence 
calculated in two different ways. The first measure, named UNAIDS-based 
incidence, is based on data from UNAIDS, the United Nations organization 
responsible for reporting on the global HIV epidemic. The second way, 
named Death-based incidence, extrapolates backwards from mortality rates 
from the Demographic Yearbook Historical Suppliment. The xu are logs of 
three different export measures: total export value reported by the World 
Development Indicators; total export value reported by NBER - United 
Nations Trade Data; and total export volume reported by NBER-United 
Nations Trade Data. The measures are labeled Log Value (WDI), Log Value 

®The arguments in this paper, in conjuction with those given for selection of controls 
in high dimensional panel models in [7], could be used to justify allowing dependence 
within countries over time. The standard errors in the current analysis are clustered by 
country whereas m assumes an AR(1) structure and a linear model and proceeds with 
Prais-Winston regression. 
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(NBER) Value, and Log Volume (NBER). For details, please refer to [30] . 
This paper abstracts away from important issues arising from measurement 
error in both incidence rates and export rates in order to provide a clearer 
illustration of the model selection methods. 

The set of controls represented by zu is richer than the controls 
considered in the original paper. zu is constructed from interac¬ 
tions of an aggregate linear time trend, t, an aggregate quadratic time 
trend, aggregate periodic time trends with periods 4 and 8 years, 
sin(27rt/4), cos(27rt/4), sin(27rt/8), cos(27rt/8), log-population, indicators for 
larger regions in Africa, intial values of HIV prevalence, GDP, log- 
population, and export levels. The final set of controls consists of 78 vari¬ 
ables and can be obtained by request. The samples consists of between 
720-747 observations for the UNAIDS-based incidence measure and 161- 
166 for the Death-based incidence measure, with discrepencies arising from 
missing observations. Therefore, a conditioning set of 78 variables, though 
more robust than the baseline, is large enough to potentially cause statistical 
problems. Therefore, this analysis will apply the model selection procedure 
discussed in the earlier sections. The estimates of g{xit) are based on a 
series expansion for g in terms of Hermite polynomials. The order of the 
approximating polynomial is chosen hy K = floorwhere n is the total 
number of observations (not observational units). 

Estimates for the sample average derivative, ^'^g'ixu), based on the 
UNAIDS measure are presented in Table 5 and estimates based on Death- 
based incidence are presented in Table 6. In each table, the first panel uses 
the Log Value (WDI) measure, the second panel uses the Log Value (NBER) 
measure, and the third panel uses the Log Volume (NBER) measure. The 
estimate is interpreted as the average percent increase in HIV incidence 
resulting from an exogenously given percent increase in export over the 
sample export values. The tables present estimates and 95% confidence 
intervals, as well as lists of selected variables for each analysis. In addition 
to the post-model selection estimates, a baseline estimate, corresponding to 
the baseline model in [Mj (excluding zu, yielding yu = g{xit) + oti + yt + 
wl^6 + eit), and a model using the full set of controls without model selection 
are presented. 

The estimates presented in Table 5 are not significantly different from 0 in 
any of the circumstances. This is partly consistent with the findings in [30], 
in that the significance does not withstand the robustness checks. The esti¬ 
mates here are, in principal, more variable relative to m since this paper 
considers non-parametric specifications for g whereas [30] considers a linear 
specification. The estimates in Table 6 are considerably more interesting. In 
particular, the interepretation of the estimates changes depending on which 
set of controls is used. Using the first measure of export. Log Value (WDI), 
the estimated average derivative is significantly different from zero under the 
baseline model. However, the estimate fails to maintiain significant separa¬ 
tion from zero when the full set of controls is used. This can be because the 
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true causal effect based on the richer conditioning set is null, or because the 
inclusion of many control variables presents a serious limitation in terms of 
statistical precision. In this situation, using a selected set of controls yields 
a statistically significant, positive estimate. In addition, the corresponding 
confidence interval in more narrow. Therefore, for these particular measures 
of incidence and export, the model selection method supports the conclu¬ 
sion of the baseline estimate of a positive average effect. Using the other 
two measures of export, the interpretations of the estimates, presented in 
Panels 2-3, do not change between the three estimation strategies. 

9. Conclusion 

This paper considers the problem of choosing a parsimonious condition¬ 
ing set in the context of nonparametric regression. Convergence rates and 
inference results are provided for series estimators of additively separable 
models with a high dimensional component. Lasso continues to have good 
selection properties in this context and can be used in post-model selection 
inference when two model selection steps are performed. 

Appendix A. Lasso Penalty Loadings Implementation 

For completeness, this section presents implementation details for Cluster- 
Lasso. The details are the same as those given in [3]. Following a mechanical 
description of penalty loadings choice, conditions are presented which are 
sufficient for establishing the asymptotic validity of the proposed algorithm 
in this appendix. 

Feasible options for setting the penalty level and the loadings for j = 
1 ,..., L are 

Initial: jE”.i 

Refined; E”.i 

«(-.)=£'? 

Penalty: = 2cy/n<^>-^{l - 'yl{2KL)) 

= 2cVn$-i(l - 7/(2L)) 

where c > 1 is a constant, 7 £ ( 0 , 1 ), d> is the cummulative distribution 
function for the standard Gaussian distribution, and e* is an estimate of 

■= Vi — E[yi\zi]. Let Nioadings ^ 1 denote a bounded number of iterations. 
This paper uses c = 1.1, 7 = 0.1/log(max{iLL, re}), and Nioadings = 15 in 
simulation examples. In what follows, Lasso/Post-Lasso estimator indicates 
that the practitioner can apply either the Lasso or Post-Lasso estimator. 
The simulations and empirical example in this paper use Post-Lasso. 






26 


DAMIAN KOZBUR 


Algorithm of Cluster-Lasso penalty loadings 

(1) Specify penalty loadings according to the initial option above. Use these 
penalty loadings in computing the Lasso/Post-Lasso estimators defined in 
Section 3.3. Then compute residuals Wik = Pk{xi) — q{zi)'Tk-, ei = yi — q{zi)'9 
for i = 1,..., n and k = 1, ..., K. 

(2) If Npenaity > 1, Update the penalty loadings according to the refined 
option above and update the Lasso/Post-Lasso estimator. Then compute a 
new set of residuals using the updated Lasso/Post-Lasso coefficients Wik = 
Pkixi) - qiziYTk, £i = yi- q{zi)'9 for i = 1,..., n and k = l, ..., K. 

(3) If Nioadings > 2, repeat step (2) Nioadings - 2 times. □ 

The results of Proposition 1 and Theorem 1 in this paper rely on asymp¬ 
totic validity of penalty loadings in the sense that ^ ^ for 

every j and i'^Jk ^ ^ every j,k with probability 1 — o(l), 

i ^ 1, and u ^ C < oo. [3] list several primitive assumptions which im¬ 
ply asymptotic validity when K = 1. The modihcations required on their 
conditions are not substantive and therefore, this paper assumes asymptotic 
validity of the penalty loadings as a high level condition. 

Appendix B. Proofs of the Main Results 

B.l. Additional notation used in proofs. In the course of the proofs, 
the following notation will be used. First, we use the standard stacking 
convention for random variables indexed by i = 1,..., n into a column vector 
of size n X 1 so that yi are stacked in Y, gi = g{xi) are stacked in G, 
hi = h{zi) are stacked in H, and so forth, Let I be the full set of series 
terms chosen in the hnal estimation coming from the dictionary q. / is 
given by / = /q U Ir,f. U /lU, ...jU/i^-. Dehne for any subset J C [p], Q{J\ 
to be the corresponding set of selected dictionary elements. Let h be the 
least squares coefficient for the regression of any vector U on Q\J] so that 
6 = 6(C/;J) = {Q[J]'Q[J])-^Q[J]'U. Let = Q[I]{Q[I]'Q[I])-Q[I] be the 

sample projection onto the space spanned by Q[I]. Let = In — be 
projection onto the corresponding orthogonal subspace. Let U = P'/n. 
Let U = E[{p{x) — Tp{z)){p{x) — Tp{z)y]. Decompose P = m + W where 
the ith element of m is dehned by nii = E\p(xi)\zi]. Let D = W'W/n. 
Let II • II denote Euclidean norm when applied to a vector and the matrix 
norm ||A|| = V iiA'A when applied to a square matrix. Let || • ||i and || • ||cxd 
denote Li and Loo norms. Let = max^. -\/{ruk — QI^kYi'^k — QI'k)/n 
and = ■\/{G + H — Q'k)'{G + H — Qp^jn be the approximation error 
in the first stage and reduced form. 
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B.2. Proof of Proposition. Begin by establishing the claim ||n — 12|| A 0 
by bounding each of the following terms seperately: ||fl — n|| = ||n — + 

O — n|| ^ ||fl — 0|| + lA — n||. The argument in Theorem 1 of Newey 
(1997), along with the fact that sup 2 ,g_;t- ||p(a:) — Tp{x)\\ ^ llp(®)ll 

gives the bound lA — n|| = Op{Co{K)K^/‘^/^/n). Next bound ||n — fi||. 
Using the decomposition, P = m + W, write fl = (m +VU)'./#^(m + lT)/n = 
W'Wjn—W'{In—jn+m'^fm/n+2m'^fW/n. By triangle inquality, 
lA — n|| ^ lllT'^j-VU/nll + \\m'^fm/n\\ + \\2m'^fW/n\\. Bounds for each 
of the three previous terms are established in Lemma 4 giving ||0 — n|| = 
Op{K(j)/n). 

Since fl has minimal eigenvalues bounded from below by assumption, 
it follows that H is invertible with probability approaching 1 (by 0 be¬ 
ing invertible with probability approaching 1 and by Lemma 4). Con¬ 
sider the event ^ = {Amin(f^) > Amin(Al)/2}. By reasoning identical to 
that given in [29], the following ’’variance” and ’’bias” terms have bounds 
l^\\^-^W'e/n\\ = Op{y/Kly/Ii) and l^\\^-^W'{G - PP)/n\\ = Op{K-^). 

To proceed, it is required to obtain analogous bounds for 1 P'/n\\ 

and P'^y{G — Pjl)/n\\. Considering the ’’variance” term hrst, note 

that 

\^^-^P'JIj€ln - Cl-^W'e/n\\ ^ 1^11(0“^ - n-^)W'e/n\\ 

+l^\\n-^{W' - P'JlY)e/n\\. 

Consider the hrst term above. 1^\\{II~^ — il“^)lT'e/n|| ^ 1 yA ma y(fl~^ — 
fi-i)||lT'e/n|| = Op(/^)Op(Co(i^)\/K/V^) = Op{^/^). For 

the second term, ||fl“^(lT' — P'./#^)e/n|| ^ l^fAmax(f2~^)||(lT' — 

P'./#-^)e/n|| = 1^0p{l)\\m'^j€/n\\ = Op{\/K(i)/y/n) by Lemma 4. 

Turning to the ’’bias” term, 

lj^\\h-^P'J^j{G-PI3)/n\\ =\^\{G-P^)'JIjPh-^P'JIj{G-P^)lnfl’^ 

= Op(l)[(G-P/3)'(G-P/3)/n]'/2 
= Op{K-^) 

by assumption on {G — P/3) and idempotency of P '= 

J^YP{P'J^Y^rP)~^-^r 

The last intermediate ingredient before putting together the proof of the 
proposition is a bound on l^||U“^P'./#^P/n|| = Op{l)\\P'^jG 2 /n\\ = 
Op{Co{K)y/K(j)/n + y/KcjP/ns -|- VKK~°‘y/^^)/y/n + Op{y/K^K~^ -|- 
Ap/pW ny/^)Jn)/y/n by triangle inequality and Lemma 4(iv) and 4(v). 
This reduces to Op{yjK/n P iC““). 

To show the proposition, bound the difference jl — (5. Note that 
l^(^-/3) = l^n-^P'^j€/n+l^n-^P'MY{G-P^)/n+l^h-^P'J^fH/n. 
Triangle inequality and bounds described above give l^||/3 — /3|| ^ 
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l^\\h-^P'^j€/n\\ + l^\\h-^P'^Y{G - P/3)/n|| + l^\\h-^P'^fH/n\\ = 
j^fn + K~°‘). The statement of the proposition follows from the 
bound on /3 — /3 using the arguments in |29j . 

B.3. Proof of Theorem. Let F = 1/“^/^ and g = p{xyj3 and decompose 
the quantity l^y/nF[ayg) — a{g)\ by 

^^VnF[a{g)-a{g)] = lnVnF[a{g)-a{g)+D{g)-D{g)+D{g)-D{g)+D{g)-D{g)]. 

By arguments given in the proof of Theorem 2 in Newey (1997), 

~ -^( 5 )]! ^ C^/nK~°‘. In addition, bounds on \'g — g\ci 
given by the proposition imply that \^/nF[a{g) — a{g) — D{g) + D{g)\ ^ 
CupV^lg- g\l = Op{CLipVnCd{K){y/K/^^ + + y/s/y/nf) 0. It 

remains to be shown that \^y/nF\D{g) — D{g)\ satisfies an appropriate 
central limit theorem. Note that DiTj) can be expanded 

D{g) = D{p{x)% = D{p{xyh-^P’j^fy) 

= D{p{xyn-^P'^f{G + H + e)) = D{p{x)yh-^P'^f{G + H + e) 

= A'Q-^P'^j-i^ + H + e) = A'Q-^P'^fG + A'^-^P'^fH + A'Q-^P'^je 
Using the above expansion and D{g) = D{p{xy 13) = A'j3 gives 
^F[D{g) - D{g)] = yJnPA'P'J^jG - /3] 

+ y/7iFA'[h-^P'j^jH] + y/7lFA'fl-^P'^p\ 

The terms y/nFA'[Q~^P'^jG —/3] and y/nFA'['D,~^P'^YH] will be shown 
negligible while the third term y/nPA'[Q~^P 'will be shown satisfying 
a central limit theorem. 

First, note the expressions || = Op(l), 1^\\FA'Q.~^/‘^\\ = 

Op(l) both hold by arguments in Newey (1997). Beginning with the first 
term, 

l^\y/^FA'[h-^P'^jG/n - /3]| 

= 1 j^\^FA'[{P'^YPl^)~^P'^ii^ - 

^ lj^\\FA'n-^P'^p/^\\\\G-Pp\\ 

^ l^\\FA'h~^/‘^\\y/nuisi^\g{xi) - g{xi)\ 

< l^WFA'n-^/^Mg - g\o = Op{l)Op{y/^K-^) = op(l) 

Next, consider y/nFA'Q,~^P'/n. By P = m + W, triangle inequality, 
Cauchy-Schwartz and Lemma 4, 

\FA'h-^P'^YH/^/^\ ^ \FA'?l-^m'^jG2/Vn\ + \\FA'h-^W'^fH/y/Ti\\ 

= Op(l)op(l) + Op(l)op(l) 
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Next consider the last remaining term for which a central limit result will 
be shown. Note that using the bounds for \\P'^j€/y/n\\ and \\m'^je/y/n\\ 
derived in Lemma 4, 

y/^FA'n-^P'J^je/n = ^/^FA'Q-^P'^pi/n + y^FA’i^-^ - 9.-^)P' 

= y/^FA'n-^P'^Y^/n + 0{\\FA'{?l-^ -Q.-^)\\\\P'Ji:Y€/y/^\\) 
= y/nF A'P' / n + op(l) 

= ^FA'n-^W'e/n + ^FA'n-^{W' - P'JijYjn + op(l) 

= V^FAl'L!-^iy'e/« + 0(||F2l'L?-^||||m'.^j€/V^||) + op(l) 

= ^/^FAl^-^W'e|n + op(l) 

Let Zin = FA'Wieijy/n. Then Yli^in = FA'V'ejy/n. For each n, Zin is 
i.i.d. with E[Zin\ = 0, X]* ElZf^ = 1. In addition, 

^ n5^\\F^\Wo{KfE[\\wifE[et\xi]]/n^6^ ^ CQoiKfKi/n ^ 0. 

By the Lindbergh-Feller Central Limit Theorem, Yhi ^in -N(0,1). 

Next consider the plug in variance estimate. First, bound ||A — ^||. In 
the case that a{g) is linear in g, then a{p'P) = A'/3 A = A. Therefore, 
it is sufficient to consider the case (ii) of Assumption 7, that a{g) is not 
linear in g. For p as in the statement of Assumption 7, Define the event 
8 = Sn = {\g - g\d < 1^/2}. In addition, let J = {D{pii;g),...,D{piK]g)y. 

Then for any j3 such that \p'(3 — g\< v 1‘i‘i it follows that |p'/3 — g\ < v and 

ls\a{p'l3) - a{g) -J'{(3- ^)|/||/3 - ^|| 

=ls\a{pl3 - a{g) - D{p'l3]g)FD{g-g)\/\\l3 - ^|| 

• |p'/3 - ?|2//3 - ^11 ^ isC ■ Cd{Kf\\P - ^11 ^ 0 


Therefore, A exists and equals J if = 1. 

Iflll - Af = 1^(1 - A)'{A -A) = l£\D{{A - A)'p-g) - D{{A - Ayp-,g)\ 

^ C • l^KA - AypU\g -g\d^C- ||A - A\\CdiK)\g - gU 

This gives l£-||A-A|| ^ C ■ Cd{K)\g - g\d = Op{Qd{Kf{-\/Kjy/nK~°^)) A 0. 

A consequence of the bound on ||A — A|| is that l£:||FA|| ^ l£:||F||||A — 
A|| -|- ||FA|| = Op(l). Similarly, l£:||FAD“^|| = Op(l). Next, define u = 
l£8l~^AE and u = l£8l~^AE. 

||u-u|| ^ i£\\EA'h-\n - h)\\ + i£\\F{A - Ay\\ 

^ 1^||f1'D-1||||L! - Q\\ + l£\F\\\A-A\\ 4 0 
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Next, note that u'T,u = Is- In addition, T, ^ C ■ I in the positive definite 
sense by Assumption. Therefore, 

ls\u'T,u — 1| = \uT,u — u'T,u\ ^ (u — u)'Tj{u — u) + \2{u — 

^ (7 • ||n — n||^ + 2{{u — n)'S(n — {u' 

^ Op(1) + Clltt — u|| A 0. 

Define S = WiW'ef/n, an infeasible sample analogue of S. By reasoning 
similar to that showing ||D — D|| A 0 it follows that ||I] — S|| A 0. Then this 
implies that ls\uJ^u — u'T;u\ = —S)tt| ^ ||{t|p||I] — S|| = Op(l)op(l) A 

0. 

Next, let Aij = g{xi) — g{xi) and A 2 i = h{zi) — h{zi). Then maxj^„ | Aj| 

Iff “ fflo = 0 ( 1 ) A 0 follows from the proposition above. Let iof = u'WiW,^ 
and Qf = u'WiW'u. Bound S to S by considering the quantity 

n n 

u'WiWie^iuln - Y u'WiWle^uIn 

i=l i=l 



n 

- 4)ln 

+ 

n 


i=l 



i=\ 


Both terms on the right hand side will be bounded. Consider the first term. 
Expanding (ef — ef) gives 


n 


n 

+ 

n 

Y^i^2i/n 

+ 

n 

Y^ AiiA2i/n 

1 = 1 


i=l 


i=l 


i=l 


£n\FVF - u'm\ = |n'(S - S)n| 


+2 

n 

YuifAuej/n 

+ 2 

n 

Yu}fA2iei/n 


i=l 


i=l 


V/ 13 
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These 


five terms above are bounded in order of their appearence by 

n n 


^ max|Aii| = op(l)Op(l) 

^' i<n ^' 

i=l i=l 

n n 

< max|A2i| Y]a;f|ejl/n = op(l)Op(l) 
i=l i=l 

n n 

Y t<;fAiiA 2 i/n ^ max|Aii| max|A 2 i| 7 Jf/n = op{l)Op{l) 

^ i<n i<n 

i=l i=l 

n n 

Y ujfAu€i/n ^ max|Aii| y^wfleil/n = op(l)Op(l) 

i=l i=l 

n n 

Y,^i^2iei/n ^ max|A2i| y^w.^leil/n = op(l)Op(l) 
i^n ' 

2=1 2=1 


Where the bounds maxj^„|Aij| = op(l) follows by the proposition and 
maxj^„ |A 2 i| = op(l) follows from Lemma 5 below. On the other hand, the 
argument that X]r=i ~ Op{l) is the same as in [29] , 

The second term is bounded by 

n 

Y HWiWl - WiW[)e\uln < max \e^ 

2 = 1 

n 

< max |?^|||u|p|| y^(WjW/ — lTiW/)/n|| = max |??|||{i||^|| ||n — 0|| 
i^n ^ i^n 

2=1 


n 

Y HWiW' - WiWDujn 
2 = 1 


^ (maxle^l+maxle^-e^l) ||u||2||||fi - f2|| 

\ 2^72 2^22 / 

= (Op(n2/9) + op(l)) Op(l)Op(iL0/v^) 
= Op(l) 


where the last bounds come from the rate condition in Assumption 9 and 
maxjscn |ef - ef| = op(l) by maxi<j„ |Aijl + |A 2 il = op(l). 

This implies that £n\FyF ~ 1| —^0. With probability approaching 1, 
If = 1, this gives FVF —^ 1 which in turn implies that 

^ ^F{e- e)/{FVFf/‘^ A a(o, i). 

To provide a rate of convergence, ll^j ^ C ■ since 9 = do + 

/^/n)^/nF{6 — 6) = 6 + Op{V^^‘^j^/n). Cauchy-Schwartz inequal¬ 
ity implies that \p'P\d ^ C(i(-^)ll/3|| for any choice of /3. Then ||A|p = 
\D{p'A)\ ^ C • \p'A\d ^ C • Cd{K)\\A\\. This gives P|| ^ C • Q{K) and 

|y| ^C-||A||2^C-CdW. 
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The proof of the second statement of the Theorem uses similar arguments 
as the proof of the hrst and follows from the proof of Theorem 3 in |29j . 

Appendix C. Lemmas 


The hrst lemma is a performance bound for Post-Lasso estimates. It 
is required for use in the next two Lemmas. It is based on the results 
of [3]. Dehne the following four events which are useful for describing the 
regularization properties of the lasso regressions. 

Afs = {X^^/n ^ c max Arf = {A^^/n ^ cmax 

j^L,k^K j^L 

Bfs = ^ k], Brf = ^ ^ 

Where i and u are the constants in Assumption 6 and Sj^RF,Skj are 
dehned as the scores of the quadratic components of the Lasso opti¬ 
mization problems in Section 3.3. Dehne the regularization event ^ = 
Afs n Arf n Bfs n Brf- In addition, dehne cq = {uc + l)/(^c — I). Let 
Kc = m.ms(zAc^r\T\i:ss6'{Q'Q/n)d/\\6r\\l where Ac,r = {d / 0 : ||(5r'=l|i ^ 
C||^tI|i}- This dehnes the restricted eigenvalue and is useful for Lasso 
bounds. For more details regarding the dehnition, see for example, m- 
Let y/^5'{Q'Q/n)6/\\^^^6r,,\\i. Dehne 

analogously using the reduced form 


Lemma 1. Under the conditions given in the text, the following inequalities 
holds. 


IggmaxW-J^fmk/Vnh ^ 1.* ( max(rt-|- 1/c 


k^K 




k<K 


UK 


+ ?,iFS 


CO 


lj^\\^YE[y\z\/y/n\\2 1/c 




UK: 


RF 

CO 


+ 


In addition, the regularization event satisfies P(i^) 


1 . 


Proof. That Arf holds with probability approaching 1 was established in 
[3]. The conditions listed in Assumption 6 allow use of the same argument 
to show that Afs holds with high probability by allowing the application 
of the moderate deviation results of [22]. In addition, Bfs,13rf hold by 
Assumption 6. 

Therefore, —)• 1 giving the last claim of the lemma. The hrst two 

claims follow immediately from the third statement of Lemma 7 in [3]. 

□ 

Lemma 2. maxfcs:A(l/«co) = Op{l) and max^^sA \{j : Pkj A 0}| = Op{s) 

Proof. For the hrst result, let a = minfej|T^j^| and b = max^j |'I'^^:®'|. 
Step I of the the proof of Theorem 1 in |3] shows that m.ax.k^K{l/Rg) ^ 
b{Rbco/a{Q'Q/'n))~^. As a simple consequence of Assumption 6, a and b 
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are bounded from above and away from zero with probability approach¬ 
ing 1. Then, also by assumption, (Kfeco/a(Q^Q/^))~^ = Op{l). This im¬ 
plies the hrst statement. For the second statement, let % be the num¬ 
ber of incorrectly selected terms in the A:-th first stage regression. Then 
By Lemma 10 of 13 ], Sk ^ sv3max(g) maxj | (2 co/Kco + ^conipS/X^/sf' 
for every integer q > 2sq)^i,y^{q) \'^jk\~‘^+ Qcqu^fS/X^/ s)‘^. 

The choice q = «:"2s(/?max(g) max^j jk\~‘^{‘^CQ/K^^ + Qc^nipS/Xy/s)‘^ yields 
^rasLxiq) = Op{l) by Assumption 5 and by using max^j = Op(l), 

maxj^fc = Op(l), maxfc 2co/«;^(, = Op(l) and Qco/n^ps/X^^y/s = 

op(l) it follows that max^^p'S^fc = Op{Ks). □ 

The following lemmas bounds various quantities used in the proof above. 

The lemma provides analogous results to steps 4-6 of [9] in the proof of their 
Theorem 1 but accounts increasing number of series terms terms in the first 
dictionary. 

Lemma 3. First Stage and Reduced Form Performance Bounds 
(i) maxfc^gx ||^jmfc/Vra|| = Op{^f)/n) 

(a) W^jH/ySiW = Op{^/K^fJn + K~°') 

(in) TCicCKk^K ||rfc(/) - Ffcll = Op{^fcf/n) 

(iv) \\b{H-J) - 7 ?|| = Op{^/fJn + K-°^) 

(v) maxfc^x \\Q'Wk/y[n\\oo = Op{ \fHs), | |Q'e/\/ra||oo = Op{^/f/s). 

(vi) maxk^K\\b{Wk;T)\\i = Op{^/K^scp/n) 

Proof. Statement (i) follows from an application of Lemma 1: 

l^max ^ max m^/v^H 

k!^K ^ ki^K 

^ max(rt -h l/c)X^^y/s/uK^ + S^ps 
k^K 

^ l^max{u + l/c){C^/nlog{m.ax{KL,n))^/s/nK^^ + 1^3^P5 

= F 3Cps = Op{^/fJn) 

Where the last equality follows from Lemma 2 and the definition of 
fps = Op{\/^Jn) and 1^ A 1. Next, Consider statement (ii). 

^ ls\\.^j{E[G\z] + H)/^\\ + l^\\J^jE[G\z]/^\\ 

^ ^s\\^iRF{^[y\A)/'/nW + l3i\\.y^jE[G\z\/y/n\\ 

^ lag{u -h l/c)X^^y/Ws/nK^^ + l^^S^pp -h 1^\\f^yE[G\z\/ y/n\\ 

^ l^(u -|- \lc){C\J nlog(max(iLL, n))y/Ks/nn^^ 

+ + '^si\\-^tE[G\z]/V nW 

To control the approximation error for the reduced form, ie. to bound fpp, 
note that 
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g{x) + h{z) = q{z)'{TI3 + g) + {g{x) - p(x)/3) + {h{z) - q{z)'g) 

The approximation error is then given by 

eRF = {G-PP + H- Qg)\G -P(3 + H- Qg)/n 

^ 2{G - PfSYiG - P^)/n + 2{H - Qg)'{H - Qg)/n 
= 0(X-2") + 0{Ks/n) 

Next consider \\^yE[G\z]/ y/n\\ and note that = ml3+E[G—mf3\z]. 

By statement (i) of this lemma, \\^Ymf3l/^/n\\ ^ max^ V^WWPiW — 

Op{^/4'/n)0{l). Next, \\^fE[G - mf3\z]/^/n\\ ^ \\^fE[G - PI3\z]/y/n\\ + 
\\^yE[PI 3 — ml3\z]/y/n\\. The first term \\^j-E[G — Pf3\z]/y/n\\ is 0{K~°‘) 
and the second term \\^YE[Pf3 — mf5\z]/^/n\\ vanishes identically. These 
put together establish that l^\\^Y^/^/n\\ ^ 1^0p(\/ Kcjy/n) + 0{K~°^). 

The result follows by noting that 1. 

Next consider statement (hi). Let T = IU supp(ri) U ... U supp(rx). 

max ||ffc(/) - Tfcll ^ Tascx.{J (pmin{\Tk\)\\fk(l) - TfcH} < max ||Q(ffc(/) - Tk)/Vn\\ 
k^K k ’ k^K 

^ ^ax{\\^fmk/^/n\\ + \\{mk - QTk)/Vn\\} = Op{\f^Jn). 

k^K 

Where the last bound follows from = Op{l) by Assumption 5 on 

the restricted eigenvalues and by T = Op{Ks) by the result of the lemma 
above. Statement (iv) follows from similar reasoning as for statement (iii). 
Statement (v): note that by Lemma 4 of [9], a sufficient condition for 

l<3'irt|/Ai 

max — , = (Jpim/s) 

Is that mm.k^K,j^LE[qj{zi)‘^WlYYGE[\qj{zi)\^\Wkif]~^/^ = 0(1) and 

log(iLL) = o(n^/^). These conditions follow from Assumption 6. In ad¬ 
dition, ~ Assumption 6. This gives the hrst 

part of statement (vi). The second part follows in the same manner. 

Statement (vi): 

max||6(ITfc;/)||i ^ max yif||6(IT'fc; I)|| ^ max yjf|(Q(/)'Q(/)/n)"iQ(/)'ITfc/n|| 

^ \\Q"^k/^/n\\oo/Vn = Op{^K'^s4>/n) 

□ 

Lemma 4. The following bounds hold. 

(i) \\W'3^jW/n\\ = Op{^/K^^/^f^) 

(ii) \\m'.y^fm/n\\ = Op{K(f>/n) 
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(in) Wm'J^fW/n\\ = Op{y/K'^cp/n) 

(iv) \\m'/^/n\\ = Op{\J()/n^/nK~^ + 

(v) \\W'^jH/^\\ = Op(C o(K)/^ vW^+/^^ + K-"/g^) 

(vi) \\W'^/n\\ = Op{-\/K(j)^/n) 

(vii) \\m'^j€/^/n\\ = Op{^/K’^cfP/n) 

Proof. Bounds for statement (i): 

\\W'^fW/nf = {Wl^jWijnf = 

k,KK 

= Y ^ Y \\KWk;T)/V^\\l\\Q'Wi/V^\\l 

k,KK k,KK 

Y\\KWk-M\l/n] (iig'wv^iiL 

ks^K j \k^K 

= Op{K^s(l)/n'^)Op{(j)/s) = Op{K^<f^lr?) 

Where the last probability bounds follow from Lemma 3. 

Next, bounds for statement (ii): 

\\m'.y^jm/n\\^ = Yj ^ Y^ y/n\Y\\.y^frni/^/n\Y 

k,li:K k,li:K 

Y \\-^pnk/Vn\Y I =Op{Kcl)/nf 

k^K / 

where again the final probability bounds follow from Lemma 3. This implies 
that ||m'^ym/n|| = op(l) by K(j)/n —)• 0. 

Statement (iii): Let Rm = m — QL track approximation errors in the first 
stage. Then 

||m'.^jlT/n|| = ||m'lT/n — m',^ylT/n|| 

= \\TQ'W/n + (m' - r'Q')W/n - m'^fW/nW 
= \\R'mW/n + (T - r(J))'Q'W/n|| 

^ \\R'^w/n\\ + ||(r - r(/))'g'w/n|| 
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Then the first term in the last line is bounded by ||-R^lT/n|| = 

Op{Co{K)^K/n-\/(j)/n) while the second term has 

II(T - T{h))'Q'W/n\\^ = J^KTfc - Vk{Ik))'Q'Wi/nf 


k,l 


IL 


k,l 


Y,\\^k-rk{Tk)\\l\\Q'Wi/V^\\ 
j;i|rfc-rfc(4)||2) (j2\\Q'Wi/V^\\ 


2 

oo 




141 j;i|rfc-rfc(i)in i^WQ'Wi/V^W 


2 

oo 


= Op{Ks)Op{(j)/n)KOp{(j)/s) 

With the last asertion following from Lemma 3. This gives Statement (iii) 
and \\m'.^fW/n\\ = op(l). 

Statement (iv): 

\\FA'h-^m'^jG2/Vn\\ ^ ||T4'0-^|| max 11 ^f/Vn\\Vn\\^jH/y/n\\ 

k^K 


= Op{l)Op{^/^pJn)^/nOp{^/(j^ + K “) = op(l). 


Statement (v): 

\\W'^jH/V^\\ ^ \\{H-Q'rjyW/V^\\ + mH-,T)-r^yQ'W/V^\\ 

^ Op{Co^/KJn-s/yyjn) + '/Kmax \\b{H;T) - ??||i||Q'Wfc/v/n||oo 

k^K 

^ Op{Co^Kct)/n) + VKOp{y^ + K-°)Op{^/^s) 


Statement (vi): 

||W'^^/V^|| = || 6 (W;/)Q'e/V^II 

^ ^!^max ||6(lTfc;/)||i||Q'e/v/n||oo 

k 

= '/KOp{y/s4>/n)Op{y/ (p/s) 


Statement (vii): By reasoning similar to that for Lemma 4(iii), it is sufficient 
to bounds ||-R^^e/v^|| = Op{-\/4)/n) and 
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||r(/) - T)'Q'e/^\\ ^ V^max|r(/) - T)'Q'e/^\ 

k^K 

^ v^max||r(I) - r)||i||Q'e/Vn||oo 

k^K 

^ y/K max \/\T\ + iv:s||r(/) - r)||||(5'e/v/n||oo 

k ’ 

= Op{K^^)Op{^s) = Op{^kWH 

Then Statement (vii) follows and the proof of Lemma 3 is complete. 

□ 


Lemma 5. maxj^„ \h{zi) — h{zi)\ = op(l) 

Proof. Let T = I U supp(?7). Then maxj|/i(2:j) — h{zi)\ ^ maxj |/i(^;j) — 
qiziYrjl +maxj \h(zi) — q(ziyrj\. The first term has the bound max* \h{xi) — 
qixifrjl = Opiy^^/cfjn) by assumption. A bound on the second term is ob¬ 
tained by the following: 

uiayi\h{zi) - q{xi)''qY‘ = max |g(xj)'(i7 - 
i i 

sf max||g^(zi)f ||iy- ryf 
^ |T| maxmax|m(zj)p||ry - r/lp 

I 

^ Op{Ks) max max \qj izi)\‘^\\fi — ry|p 
i j^L 


Then 

||iy - r/ll = \\b{y - G;T) - y\\ = ||6(G; f) + b{H-, T) + 6(e; I) - b{G; T) - rj\\ 
^\\b{Hy)-rj\\ + \\b{e-,T)\\ + \\b{G-G-,T)\\ 

First note that \\b{H;I) — r/|| = Op{^J(f/n + K~°‘) by Lemma 3. Next, 

||6(e;/)|| ^ y^</>min(?)||<3'e/’T-lloo = Op{^/^)Op{l)\\Q'e/^/n\\^/^/n = 

Op{y/K(f>/n). Finally, 

||6(G - G-J)\\ ^ \\b{P{P - ^);I)|| + ||6(G - PP;T)\\. The right term is 
\\b{G — Pfj] 1)11 = Op{K~°‘). The left term is bounded by 

\\b{P{P - P)]T)\\ ^ (fraini\T\)~^VTm&x\Y^qj{zi)p{xiy{P - P)/n\ 

i 

^ max V |gj(zi)||p(xi)/n|||||(/3 - ^)|| 

3 ^ 

i 

= Op{l)Op{y/K s)C,q{K) 0p{y/K/n + K~°‘) maxy^ |m(^;j)|/n 

3 ^ 

I 

= op(l) 
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by the rate condition in Assumption 9. This completes the argument that 
maxj \h{zi) — q{ziyri\ = op(l) and Lemma 5 follows. □ 
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Appendix D. Tables 

Table 1. Additively Separable Simulation Results: Low Di¬ 
mensional Design, Average Derivative 



n 

= 500 



n = 1000 



Med. Bias 

MAD 

RP 5% 

Med. Bias 

MAD 

RP 5% 


A. High First Stage Signal/Noise, High Structural Signal/Noise 

Post-Double 

-0.043 

0.074 

0.068 

-0.028 

0.047 

0.074 

Post-Double Set 

-0.039 

0.072 

0.078 

-0.030 

0.049 

0.088 

Post-Double Ext 

-0.049 

0.077 

0.076 

-0.030 

0.048 

0.076 

Post-Double Set-hExt 

-0.048 

0.074 

0.086 

-0.030 

0.048 

0.088 

Post-Single I 

-0.109 

0.110 

0.216 

-0.095 

0.096 

0.308 

Post-Single II 

-0.II6 

0.117 

0.236 

-0.118 

0.118 

0.426 

Series I 

-0.009 

0.072 

0.076 

-0.011 

0.044 

0.056 

Series II 

-0.020 

0.078 

0.100 

-0.020 

0.056 

0.088 

Oracle 

0.001 

0.059 

0.058 

-0.001 

0.043 

0.054 


B. High First Stage Signal/N( 

3ise, Low Structural Signal/Noise 

Post-Double 

-0.027 

0.039 

0.086 

-0.017 

0.025 

0.074 

Post-Double Set 

-0.025 

0.037 

0.096 

-0.017 

0.026 

0.078 

Post-Double Ext 

-0.028 

0.039 

0.092 

-0.020 

0.027 

0.088 

Post-Double Set-hExt 

-0.030 

0.040 

0.102 

-0.021 

0.027 

0.096 

Post-Single I 

-0.030 

0.039 

0.092 

-0.024 

0.029 

0.108 

Post-Single II 

-0.033 

0.041 

0.102 

-0.034 

0.037 

0.156 

Series I 

-0.002 

0.036 

0.072 

-0.003 

0.022 

0.060 

Series II 

-0.008 

0.040 

0.090 

-0.008 

0.026 

0.066 

Oracle 

-0.001 

0.032 

0.060 

-0.002 

0.022 

0.048 


C. Low First Stage Signal/Noise, High Structural Signal/Noise 

Post-Double 

-0.034 

0.042 

0.104 

-0.027 

0.032 

0.120 

Post-Double Set 

-0.034 

0.040 

0.108 

-0.028 

0.032 

0.150 

Post-Double Ext 

-0.038 

0.044 

0.108 

-0.028 

0.033 

0.124 

Post-Double Set-hExt 

-0.037 

0.042 

0.120 

-0.028 

0.032 

0.156 

Post-Single I 

-0.056 

0.058 

0.244 

-0.035 

0.038 

0.182 

Post-Single II 

-0.116 

0.116 

0.670 

-0.117 

0.117 

0.918 

Series I 

-0.009 

0.037 

0.078 

-0.011 

0.023 

0.060 

Series II 

-0.018 

0.041 

0.096 

-0.018 

0.030 

0.094 

Oracle 

0.001 

0.030 

0.060 

-0.000 

0.022 

0.052 


D. Low First Stage Signal/Noise, Low Structural Signal/Noise 

Post-Double 

-0.015 

0.020 

0.104 

-0.008 

0.013 

0.078 

Post-Double Set 

-0.015 

0.020 

0.118 

-0.008 

0.013 

0.090 

Post-Double Ext 

-0.015 

0.021 

0.114 

-0.008 

0.013 

0.078 

Post-Double Set-hExt 

-0.015 

0.021 

0.126 

-0.009 

0.013 

0.098 

Post-Single I 

-0.015 

0.021 

0.108 

-0.009 

0.014 

0.082 

Post-Single II 

-0.032 

0.033 

0.260 

-0.033 

0.033 

0.468 

Series I 

-0.003 

0.018 

0.074 

-0.003 

0.012 

0.066 

Series II 

-0.007 

0.021 

0.098 

-0.006 

0.013 

0.092 

Oracle 

-0.001 

0.016 

0.058 

-0.000 

0.011 

0.050 


Note: Results are based on 500 simulation replications. The table reports median bias (Med. 
Bias), median absolute deviation (MAD) and rejection frequency for a 5% level test (RP 5%) for 
nine different estimators of the function evaluated at the mean: the Post-Double proposed in 
this paper; the three variants of Post-Double (Post-Double Set, Post-Double Ext, 
Post-Double-SetH-Ext) discussed in Sections 6,7; a post-model selection estimator (Post-Single I) 
based on selecting terms with Lasso on the reduced form equation only; a post-model selection 
estimator (Post-Single II) based on selecting terms using Lasso on the outcome equation 
(Post-Single II); and two estimators that uses a relatively small number of series terms (Series I, 
Series II); and and infeasible estimator that is explicitly given the control function. 













40 


DAMIAN KOZBUR 


Table 2. Additively Separable Models Simulation Results: 
Low Dimensional Design, Evaluation at the Mean 



n 

= 500 



n = 1000 



Med. Bias 

MAD 

RP 5% 

Med. Bias 

MAD 

RP 5% 


A. High First Stage Signal/Noise, High Structural Signal/Noise 

Post-Double 

-0.064 

0.182 

0.074 

-0.035 

0.124 

0.044 

Post-Double Set 

-0.082 

0.152 

0.074 

-0.037 

0.111 

0.038 

Post-Double Ext 

-0.075 

0.181 

0.078 

-0.040 

0.124 

0.044 

Post-Double Set-hExt 

-0.094 

0.154 

0.080 

-0.041 

0.110 

0.038 

Post-Single I 

-0.177 

0.218 

0.128 

-0.136 

0.163 

0.088 

Post-Single II 

-0.188 

0.232 

0.130 

-0.175 

0.189 

0.118 

Series I 

-0.018 

0.193 

0.070 

-0.004 

0.122 

0.050 

Series II 

-0.036 

0.212 

0.094 

-0.027 

0.137 

0.060 

Oracle 

-0.010 

0.181 

0.058 

0.011 

0.118 

0.040 


B. High First Stage Signal/N( 

oise, Low Structural Signal/Noise 

Post-Double 

-0.066 

0.200 

0.070 

-0.052 

0.122 

0.048 

Post-Double Set 

-0.168 

0.189 

0.130 

-0.065 

0.111 

0.054 

Post-Double Ext 

-0.082 

0.202 

0.074 

-0.055 

0.123 

0.056 

Post-Double Set-hExt 

-0.171 

0.188 

0.144 

-0.074 

0.115 

0.070 

Post-Single I 

-0.083 

0.209 

0.072 

-0.063 

0.130 

0.054 

Post-Single II 

-0.098 

0.212 

0.078 

-0.093 

0.138 

0.058 

Series I 

-0.012 

0.193 

0.066 

0.002 

0.122 

0.048 

Series II 

-0.021 

0.212 

0.094 

-0.017 

0.144 

0.088 

Oracle 

0.005 

0.182 

0.074 

-0.005 

0.119 

0.050 


C. Low First Stage Signal/Noise, High Structural Signal/Noise 

Post-Double 

-0.049 

0.097 

0.096 

-0.036 

0.068 

0.060 

Post-Double Set 

-0.067 

0.082 

0.104 

-0.040 

0.060 

0.068 

Post-Double Ext 

-0.052 

0.097 

0.098 

-0.037 

0.069 

0.058 

Post-Double Set-hExt 

-0.070 

0.083 

0.114 

-0.042 

0.060 

0.068 

Post-Single I 

-0.087 

0.116 

0.120 

-0.050 

0.074 

0.074 

Post-Single II 

-0.187 

0.189 

0.286 

-0.181 

0.181 

0.414 

Series I 

-0.017 

0.097 

0.078 

-0.010 

0.064 

0.042 

Series II 

-0.032 

0.107 

0.098 

-0.027 

0.072 

0.056 

Oracle 

-0.004 

0.089 

0.058 

0.005 

0.060 

0.040 


D. Low First Stage Signal/Noise, Low Structural Signal/Noise 

Post-Double 

-0.045 

0.101 

0.082 

-0.020 

0.064 

0.048 

Post-Double Set 

-0.115 

0.122 

0.224 

-0.042 

0.063 

0.072 

Post-Double Ext 

-0.047 

0.102 

0.082 

-0.023 

0.063 

0.052 

Post-Double Set-hExt 

-0.120 

0.123 

0.232 

-0.042 

0.064 

0.078 

Post-Single I 

-0.046 

0.098 

0.084 

-0.025 

0.064 

0.054 

Post-Single II 

-0.099 

0.122 

0.128 

-0.090 

0.099 

0.112 

Series I 

-0.017 

0.094 

0.070 

-0.005 

0.062 

0.054 

Series II 

-0.022 

0.110 

0.104 

-0.014 

0.074 

0.086 

Oracle 

-0.007 

0.090 

0.070 

-0.002 

0.058 

0.046 


Note: Results are based on 500 simulation replications. The table reports median bias (Med. 
Bias), median absolute deviation (MAD) and rejection frequency for a 5% level test (RP 5%) for 
nine different estimators of the function evaluated at the mean: the Post-Double proposed in 
this paper; the three variants of Post-Double (Post-Double Set, Post-Double Ext, 
Post-Double-SetH-Ext) discussed in Sections 6,7; a post-model selection estimator (Post-Single I) 
based on selecting terms with Lasso on the reduced form equation only; a post-model selection 
estimator (Post-Single II) based on selecting terms using Lasso on the outcome equation 
(Post-Single II); and two estimators that uses a relatively small number of series terms (Series I, 
Series II); and and infeasible estimator that is explicitly given the control function. 
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Table 3. Additively Separable Simulation Results: High 
Dimensional Design, Average Derivative 



n 

= 500 



n = 1000 



Med. Bias 

MAD 

RP 5% 

Med. Bias 

MAD 

RP 5% 


A. High First Stage Signal/Noise, High Structural Signal/Noise 

Post-Double 

-0.010 

0.060 

0.044 

-0.006 

0.042 

0.048 

Post-Double Set 

-0.008 

0.060 

0.042 

-0.005 

0.042 

0.048 

Post-Double Ext 

-0.013 

0.060 

0.042 

-0.006 

0.043 

0.052 

Post-Double Set-hExt 

-0.011 

0.060 

0.042 

-0.005 

0.043 

0.046 

Post-Single I 

-0.056 

0.071 

0.096 

-0.028 

0.050 

0.092 

Post-Single II 

-0.688 

0.688 

1.000 

-0.692 

0.692 

1.000 

Series I 

-0.008 

0.146 

0.386 

-0.027 

0.111 

0.424 

Series II 

-0.419 

0.424 

0.782 

-0.458 

0.458 

0.872 

Oracle 

0.002 

0.033 

0.042 

-0.001 

0.025 

0.052 


B. High First Stage Signal/N( 

oise, Low Structural Signal/Noise 

Post-Double 

-0.007 

0.031 

0.040 

-0.008 

0.026 

0.136 

Post-Double Set 

-0.007 

0.030 

0.052 

-0.003 

0.021 

0.050 

Post-Double Ext 

-0.007 

0.030 

0.040 

-0.009 

0.026 

0.136 

Post-Double Set-hExt 

-0.008 

0.029 

0.056 

-0.003 

0.022 

0.052 

Post-Single I 

-0.014 

0.032 

0.056 

-0.011 

0.027 

0.140 

Post-Single II 

-0.357 

0.357 

1.000 

-0.356 

0.356 

1.000 

Series I 

-0.002 

0.072 

0.390 

-0.291 

0.291 

0.924 

Series II 

-0.155 

0.168 

0.682 

-0.313 

0.313 

0.988 

Oracle 

-0.001 

0.025 

0.040 

-0.001 

0.020 

0.058 


C. Low First Stage Signal/Noise, High Structural Signal/Noise 

Post-Double 

-0.010 

0.031 

0.056 

-0.005 

0.021 

0.050 

Post-Double Set 

-0.008 

0.030 

0.054 

-0.003 

0.021 

0.052 

Post-Double Ext 

-0.011 

0.030 

0.062 

-0.006 

0.022 

0.052 

Post-Double Set-hExt 

-0.010 

0.030 

0.056 

-0.004 

0.021 

0.062 

Post-Single I 

-0.017 

0.032 

0.068 

-0.009 

0.022 

0.062 

Post-Single II 

-0.689 

0.689 

1.000 

-0.692 

0.692 

1.000 

Series I 

-0.004 

0.073 

0.386 

-0.022 

0.060 

0.456 

Series II 

-0.414 

0.414 

0.858 

-0.449 

0.449 

0.916 

Oracle 

0.000 

0.017 

0.044 

-0.001 

0.013 

0.054 


D. Low First Stage Signal/Noise, Low Structural Signal/Noise 

Post-Double 

-0.004 

0.015 

0.044 

-0.005 

0.014 

0.136 

Post-Double Set 

-0.004 

0.015 

0.052 

-0.001 

0.011 

0.066 

Post-Double Ext 

-0.005 

0.016 

0.044 

-0.005 

0.013 

0.138 

Post-Double Set-hExt 

-0.005 

0.016 

0.052 

-0.002 

0.011 

0.064 

Post-Single I 

-0.005 

0.016 

0.044 

-0.005 

0.013 

0.142 

Post-Single II 

-0.357 

0.357 

1.000 

-0.356 

0.356 

1.000 

Series I 

-0.001 

0.036 

0.398 

-0.295 

0.295 

0.986 

Series II 

-0.153 

0.153 

0.808 

-0.317 

0.317 

0.994 

Oracle 

-0.001 

0.012 

0.048 

-0.000 

0.010 

0.078 


Note: Results are based on 500 simulation replications. The table reports median bias (Med. 
Bias), median absolute deviation (MAD) and rejection frequency for a 5% level test (RP 5%) for 
nine different estimators of the function evaluated at the mean: the Post-Double proposed in 
this paper; the three variants of Post-Double (Post-Double Set, Post-Double Ext, 
Post-Double-SetH-Ext) discussed in Sections 6,7; a post-model selection estimator (Post-Single I) 
based on selecting terms with Lasso on the reduced form equation only; a post-model selection 
estimator (Post-Single II) based on selecting terms using Lasso on the outcome equation 
(Post-Single II); and two estimators that uses a relatively small number of series terms (Series I, 
Series II); and and infeasible estimator that is explicitly given the control function. 
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Table 4. Additively Separable Simulation Results: HighDi- 
mensional Design, Evaluation at the Mean 



n 

= 500 



n = 1000 



Med. Bias 

MAD 

RP 5% 

Med. Bias 

MAD 

RP 5% 


A. High First Stage Signal/Noise, High Structural Signal/Noise 

Post-Double 

-0.031 

0.214 

0.034 

0.006 

0.151 

0.038 

Post-Double Set 

-0.102 

0.182 

0.066 

-0.016 

0.147 

0.054 

Post-Double Ext 

-0.038 

0.220 

0.036 

0.002 

0.154 

0.038 

Post-Double Set-hExt 

-0.113 

0.185 

0.066 

-0.021 

0.148 

0.052 

Post-Single I 

-0.126 

0.232 

0.100 

-0.051 

0.148 

0.054 

Post-Single II 

-1.654 

1.654 

1.000 

-1.662 

1.662 

1.000 

Series I 

0.066 

0.546 

0.440 

-0.091 

0.367 

0.376 

Series II 

-0.964 

0.979 

0.706 

-1.030 

1.030 

0.812 

Oracle 

0.014 

0.192 

0.054 

0.015 

0.131 

0.054 


B. High First Stage Signal/N( 

oise, Low Structural Signal/Noise 

Post-Double 

-0.063 

0.190 

0.042 

-0.032 

0.158 

0.136 

Post-Double Set 

-0.137 

0.175 

0.108 

-0.048 

0.124 

0.062 

Post-Double Ext 

-0.066 

0.190 

0.044 

-0.035 

0.155 

0.136 

Post-Double Set-hExt 

-0.139 

0.175 

0.118 

-0.049 

0.125 

0.062 

Post-Single I 

-0.088 

0.191 

0.044 

-0.047 

0.160 

0.136 

Post-Single II 

-1.207 

1.207 

0.982 

-1.187 

1.187 

1.000 

Series I 

0.007 

0.482 

0.442 

-0.753 

0.753 

0.294 

Series II 

-0.481 

0.619 

0.534 

-0.760 

0.760 

0.354 

Oracle 

-0.033 

0.176 

0.052 

-0.000 

0.135 

0.100 


C. Low First Stage Signal/Noise, High Structural Signal/Noise 

Post-Double 

-0.029 

0.116 

0.046 

-0.002 

0.076 

0.042 

Post-Double Set 

-0.088 

0.111 

0.114 

-0.018 

0.076 

0.066 

Post-Double Ext 

-0.031 

0.116 

0.048 

-0.004 

0.076 

0.042 

Post-Double Set-hExt 

-0.094 

0.115 

0.118 

-0.021 

0.076 

0.060 

Post-Single I 

-0.050 

0.116 

0.054 

-0.013 

0.077 

0.042 

Post-Single II 

-1.674 

1.674 

1.000 

-1.660 

1.660 

1.000 

Series I 

0.026 

0.275 

0.442 

-0.054 

0.190 

0.428 

Series II 

-0.937 

0.938 

0.842 

-1.055 

1.055 

0.890 

Oracle 

-0.000 

0.096 

0.058 

0.008 

0.068 

0.052 


D. Low First Stage Signal/Noise, Low Structural Signal/Noise 

Post-Double 

-0.045 

0.099 

0.046 

-0.018 

0.078 

0.138 

Post-Double Set 

-0.125 

0.128 

0.252 

-0.042 

0.070 

0.092 

Post-Double Ext 

-0.046 

0.099 

0.046 

-0.018 

0.078 

0.140 

Post-Double Set-hExt 

-0.127 

0.129 

0.252 

-0.044 

0.069 

0.088 

Post-Single I 

-0.046 

0.098 

0.046 

-0.021 

0.078 

0.142 

Post-Single II 

-1.201 

1.201 

1.000 

-1.196 

1.196 

1.000 

Series I 

-0.004 

0.241 

0.440 

-0.752 

0.752 

0.836 

Series II 

-0.475 

0.495 

0.658 

-0.758 

0.758 

0.890 

Oracle 

-0.024 

0.088 

0.052 

0.004 

0.070 

0.098 


Note: Results are based on 500 simulation replications. The table reports median bias (Med. 
Bias), median absolute deviation (MAD) and rejection frequency for a 5% level test (RP 5%) for 
nine different estimators of the function evaluated at the mean: the Post-Double proposed in 
this paper; the three variants of Post-Double (Post-Double Set, Post-Double Ext, 
Post-Double-SetH-Ext) discussed in Sections 6,7; a post-model selection estimator (Post-Single I) 
based on selecting terms with Lasso on the reduced form equation only; a post-model selection 
estimator (Post-Single II) based on selecting terms using Lasso on the outcome equation 
(Post-Single II); and two estimators that uses a relatively small number of series terms (Series I, 
Series II); and and infeasible estimator that is explicitly given the control function. 
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Table 5. Average Effects of Export on HIV: 
UNAIDS Incidence 


Baseline: 

All: 

Post-Double: 


Baseline: 

All: 

Post-Double: 


Baseline: 

All: 

Post-Double: 


Ave. Derivative 


95% Confidence Interval 


1. Log Export Value (WDI) 

onls 

0.044 
0.013 

n = 720, K = 5, L = 78 

Selected vars: lag-incidence, t X incidenceo, sin(27rt/4) X 
incidenceo, cos(2'7rt/4) xgdpQ, sin(27rt/8) xgdpQ, incidenceo X 
lag-incidence 

1. Log Export Value (WDI) 
olios [ -0.238 0.254 ] 

0.038 [ -0.052 0.128 ] 

0.038 [-0.117 0.195] 

n = 747, K = 5, L = 78 

Selected vars: lag-incidence, t x incidenceo, sin(27rt/4) X 
incidenceo, cos(27rt/4) X Iregion 2i sin(27rt/8) X incidenceo, 
cos(27rt/8) X Iregion 2 incidencefl X lag-incidence, pop X gdpo 
1. Log Export Value (WDI) 

01)28 [ -0.108 0.166 ] 

-0.041 [ -0.131 0.048 ] 

0.059 -0.053 0.172 


[ -0.135 0.165 ] 

[-0.032 0.121] 

[-0.104 0.131] 


n = 747, K = 5, L = 78 

Selected vars: lag-incidence, t X incidenceo sin(27rt/4) X 
incidenceo sin(27rt/8) X Iregion 2 incidenceo x lag-incidence 
pop X popo 


The table presents estimates of the sample weighted average derivative of growth in HIV 
incidence with respect to growth in export. The estimates are calculated using a baseline model, 
using post-double selection over the set of conditioning variables, and over the entire set of 
conditioning variables. All variables are generated from data given in m- n gives the number of 
total observsations (not observational units). Estimates are based on a K-order Hermite 
polynomial in the export measure. L gives the number of conditioning variables. 
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Table 6 . Average effects of Export on HIV: 
Death-based Incidence 


Ave. Derivative 95% Confidence Interval 




1. Log Export Value (WDI) 

Baseline: 

0.915 

[ 0.294 1.536 ] 

All: 

0.239 

[-0.755 1.235 ] 

Post-Double: 

0.904 

[ 0.348 1.461 ] 


n = 161, K = 

3, L = 78 


Selected vars: 

sin(27rt/8) x xq 



2. Log Export Value (NBER) 

Baseline: 

0.733 

[ 0.274 1.191 ] 

All: 

0.614 

[ 0.330 0.899 ] 

Post-Double: 

0.646 

[ 0.112 1.180 ] 


n = 166, K = 

3, L = 78 


Selected vars: 

X exportg, cos(27rt/4) x pop, sin(27r£/8) x 


popQ, cos(27rt/8) X popQ, cos(27rt/8) X pop 



3. Log Export Volume (NBER) 

Baseline: 

0.405 

[-0.354 1.165 ] 

All: 

0.398 

[-0.015 0.811] 

Post-Double: 

0.366 

[ -0.347 1.080 ] 


n = 166, K = 3, L = 7S 


Selected vars: x exportp, cos(27rt/8) x popp 

The table presents estimates of the sample weighted average derivative of growth in HIV 
incidence with respect to growth in export. The estimates are calculated using a baseline model, 
using post-double selection over the set of conditioning variables, and over the entire set of 
conditioning variables. All variables are generated from data given in [30]. n gives the number of 
total observsations (not observational units). Estimates are based on a A'-order Hermite 
polynomial in the export measure. L gives the number of conditioning variables. 
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